From pedropazzini at gmail.com Mon Jan 2 15:44:25 2017 From: pedropazzini at gmail.com (Pedro Pazzini) Date: Mon, 2 Jan 2017 18:44:25 -0200 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' Message-ID: Hi all! I'm trying to use a KNeighborsClassifier with precomputed metric. In it's predict method (http://scikit-learn.org/stable/modules/generated/sklearn .neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict) it says the input should be: "(n_query, n_indexed) if metric == ?precomputed?" What is n_indexed? Shouldn't the shape of the input in the predict method be (n_query,n_query)? How can I use the predict method after fitting the classifier with a distance matrix? Regards, Pedro Pazzini -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jan 2 16:10:20 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 3 Jan 2017 08:10:20 +1100 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' In-Reply-To: References: Message-ID: n_indexed means the number of samples in the X passed to fit. It needs to be able to compare each prediction sample with each training sample. On 3 January 2017 at 07:44, Pedro Pazzini wrote: > Hi all! > > I'm trying to use a KNeighborsClassifier with precomputed metric. In it's > predict method (http://scikit-learn.org/stable/modules/generated/sklearn > .neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNe > ighborsClassifier.predict) it says the input should be: > > "(n_query, n_indexed) if metric == ?precomputed?" > > What is n_indexed? > > Shouldn't the shape of the input in the predict method be > (n_query,n_query)? > > How can I use the predict method after fitting the classifier with a > distance matrix? > > Regards, > Pedro Pazzini > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedropazzini at gmail.com Tue Jan 3 10:33:22 2017 From: pedropazzini at gmail.com (Pedro Pazzini) Date: Tue, 3 Jan 2017 13:33:22 -0200 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' In-Reply-To: References: Message-ID: Joel, Your explanation helped me understand it. Thanks! 2017-01-02 19:10 GMT-02:00 Joel Nothman : > n_indexed means the number of samples in the X passed to fit. It needs to > be able to compare each prediction sample with each training sample. > > On 3 January 2017 at 07:44, Pedro Pazzini wrote: > >> Hi all! >> >> I'm trying to use a KNeighborsClassifier with precomputed metric. In >> it's predict method (http://scikit-learn.org/stable/modules/generated/ >> sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeigh >> borsClassifier.predict) it says the input should be: >> >> "(n_query, n_indexed) if metric == ?precomputed?" >> >> What is n_indexed? >> >> Shouldn't the shape of the input in the predict method be >> (n_query,n_query)? >> >> How can I use the predict method after fitting the classifier with a >> distance matrix? >> >> Regards, >> Pedro Pazzini >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jan 3 12:19:33 2017 From: t3kcit at gmail.com (Andy) Date: Tue, 3 Jan 2017 12:19:33 -0500 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' In-Reply-To: References: Message-ID: <1e4a624b-4f02-b621-f7c8-1cee4c2c6786@gmail.com> Should probably be called n_samples_train? On 01/02/2017 04:10 PM, Joel Nothman wrote: > n_indexed means the number of samples in the X passed to fit. It needs > to be able to compare each prediction sample with each training sample. > > On 3 January 2017 at 07:44, Pedro Pazzini > wrote: > > Hi all! > > I'm trying to use a KNeighborsClassifier with precomputed metric. > In it's predict method > (http://scikit-learn.org/stable/modules/generated/ > sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict) > it says the input should be: > > "(n_query, n_indexed) if metric == ?precomputed?" > > What is n_indexed? > > Shouldn't the shape of the input in the predict method be > (n_query,n_query)? > > How can I use the predict method after fitting the classifier with > a distance matrix? > > Regards, > Pedro Pazzini > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Tue Jan 3 12:31:44 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Tue, 03 Jan 2017 17:31:44 +0000 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' In-Reply-To: <1e4a624b-4f02-b621-f7c8-1cee4c2c6786@gmail.com> References: <1e4a624b-4f02-b621-f7c8-1cee4c2c6786@gmail.com> Message-ID: That would be most helpful. Maybe also explain the logic? On Tue, 3 Jan 2017 at 18:19 Andy wrote: > Should probably be called n_samples_train? > > > On 01/02/2017 04:10 PM, Joel Nothman wrote: > > n_indexed means the number of samples in the X passed to fit. It needs to > be able to compare each prediction sample with each training sample. > > On 3 January 2017 at 07:44, Pedro Pazzini wrote: > > Hi all! > > I'm trying to use a KNeighborsClassifier with precomputed metric. In it's > predict method (http://scikit-learn.org/stable/modules/generated/sklearn > .neighbors.KNeighborsClassifier.html#sklearn.neighbors. > KNeighborsClassifier.predict) it says the input should be: > > "(n_query, n_indexed) if metric == ?precomputed?" > > What is n_indexed? > > Shouldn't the shape of the input in the predict method be > (n_query,n_query)? > > How can I use the predict method after fitting the classifier with a > distance matrix? > > Regards, > Pedro Pazzini > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedropazzini at gmail.com Tue Jan 3 13:09:57 2017 From: pedropazzini at gmail.com (Pedro Pazzini) Date: Tue, 3 Jan 2017 16:09:57 -0200 Subject: [scikit-learn] KNeighborsClassifier and metric='precomputed' In-Reply-To: References: <1e4a624b-4f02-b621-f7c8-1cee4c2c6786@gmail.com> Message-ID: If I understood, each row of the input matrix in the predict method contains the distances from a query point to each point in the training set. I think the reference should make this more clear. 2017-01-03 15:31 GMT-02:00 federico vaggi : > That would be most helpful. Maybe also explain the logic? > > On Tue, 3 Jan 2017 at 18:19 Andy wrote: >> >> Should probably be called n_samples_train? >> >> >> On 01/02/2017 04:10 PM, Joel Nothman wrote: >> >> n_indexed means the number of samples in the X passed to fit. It needs to >> be able to compare each prediction sample with each training sample. >> >> On 3 January 2017 at 07:44, Pedro Pazzini wrote: >>> >>> Hi all! >>> >>> I'm trying to use a KNeighborsClassifier with precomputed metric. In it's >>> predict method >>> (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict) >>> it says the input should be: >>> >>> "(n_query, n_indexed) if metric == ?precomputed?" >>> >>> What is n_indexed? >>> >>> Shouldn't the shape of the input in the predict method be >>> (n_query,n_query)? >>> >>> How can I use the predict method after fitting the classifier with a >>> distance matrix? >>> >>> Regards, >>> Pedro Pazzini >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From jonathan.taylor at stanford.edu Tue Jan 3 20:07:11 2017 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Tue, 3 Jan 2017 17:07:11 -0800 Subject: [scikit-learn] modifying CV score Message-ID: I'm looking for a simple way to get a small pipeline for choosing a parameter using a modification of CV for regression type problems. The modification is pretty simple, so, for squared-error or logistic deviance, it is a simple modification of the score of `Y` (binary labels) and `X.dot(beta)` (linear predictor). I've been trying to understand how to use sklearn for this as there is no need for me to rewrite the basic CV functions. I'd like to be able to use my own custom estimator (so I guess I just need a subclass of BaseEstimator with a `fit` method with (X,y) signature?), as well as my own modification of the score. I'd be happy to understand the code for an estimator whose fit returns `np.zeros(X.shape[1])` and a given scoring function like def score(estimator, X_test, y_test): beta = estimator.parameters_ # which is just a zero vector for my estimator -- I guess this is the way I should extract the linear # predictor linpred = X_test.dot(beta) #or maybe? linpred = estimator.transform(X_test) return np.linalg.norm(y_test - linpred) This would not be an interesting model, but it would help me understand how things are evaluated in the CV loop. I have read how to create a custom scorer in the docs but it does not seem to describe what `estimator` will be inside the CV loop. I presume a custom scorer will get called with values X_test and y_test and I suppose estimator will be a model fit to X_train and y_train? -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Jan 4 07:44:22 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 4 Jan 2017 13:44:22 +0100 Subject: [scikit-learn] modifying CV score In-Reply-To: References: Message-ID: You can indeed derive from BaseEstimator and implement fit, predict and optionally score. Here is the documentation for the expected estimator API: http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects As this is a linear regression model, you can also want to have a look at the LinearModel and RegressionMixin base classes for inspiration: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L401 Note that the score function should always be "higher is better". The explained variance ratio and negative mean squared error are valid scoring functions for model selection in scikit-learn while raw MSE is not not. -- Olivier From gael.varoquaux at normalesup.org Wed Jan 4 07:50:42 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 4 Jan 2017 13:50:42 +0100 Subject: [scikit-learn] modifying CV score In-Reply-To: References: Message-ID: <20170104125042.GG3264230@phare.normalesup.org> > I've been trying to understand how to use sklearn for this as there is > no need for me to rewrite the basic CV functions. I'd like to be able > to use my own custom estimator (so I guess I just need a subclass of > BaseEstimator with a `fit` method with (X,y) signature?), as well as my > own modification of the score. Be aware that scikit-learn assume a few things about estimators. One of them being that the __init__ should not do anything else than store the parameters that it is given. > I'd be happy to understand the code for an estimator whose fit returns > `np.zeros(X.shape[1])` Another assumption is that "fit" always returns self. The API that defines a scikit-learn object is detailed here: http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects From jonathan.taylor at stanford.edu Wed Jan 4 16:47:29 2017 From: jonathan.taylor at stanford.edu (Jonathan Taylor) Date: Wed, 4 Jan 2017 13:47:29 -0800 Subject: [scikit-learn] modifying CV score Message-ID: (Think this is right reply to from a digest... If not, apologies) Thanks for the pointers. From what I read on the API, I gather that for an estimator with a score method, inside GridSearchCV there will be pseudo-code like ... estimator.fit(X_train, y_train) scorer = estimator.score return scorer(X_test, y_test) For custom scores that are not methods of an estimator, I guess the `make_scorer` function returns a callable with the same signature as a score method of an estimator? -- Jonathan Taylor Dept. of Statistics Sequoia Hall, 137 390 Serra Mall Stanford, CA 94305 Tel: 650.723.9230 Fax: 650.725.8977 Web: http://www-stat.stanford.edu/~jtaylo -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Jan 4 22:06:43 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 5 Jan 2017 14:06:43 +1100 Subject: [scikit-learn] modifying CV score In-Reply-To: References: Message-ID: Well, it returns the equivalent of lambda estimator, X, y: estimator.score(X, y) On 5 January 2017 at 08:47, Jonathan Taylor wrote: > (Think this is right reply to from a digest... If not, apologies) > > Thanks for the pointers. From what I read on the API, I gather that for an > estimator with a score method, inside GridSearchCV there will be > pseudo-code like > > ... > estimator.fit(X_train, y_train) > scorer = estimator.score > return scorer(X_test, y_test) > > > For custom scores that are not methods of an estimator, I guess the > `make_scorer` function returns a callable with the same signature as a > score method of an estimator? > > -- > Jonathan Taylor > Dept. of Statistics > Sequoia Hall, 137 > 390 Serra Mall > Stanford, CA 94305 > Tel: 650.723.9230 <(650)%20723-9230> > Fax: 650.725.8977 <(650)%20725-8977> > Web: http://www-stat.stanford.edu/~jtaylo > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sat Jan 7 11:15:54 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 7 Jan 2017 17:15:54 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor Message-ID: Greetings, I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? Thanks in advance for any hint. Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sat Jan 7 13:27:21 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 7 Jan 2017 13:27:21 -0500 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: Message-ID: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> Hi, Thomas, the VotingClassifier can combine different models per majority voting amongst their predictions. Unfortunately, it refits the classifiers though (after cloning them). I think we implemented it this way to make it compatible to GridSearch and so forth. However, I have a version of the estimator that you can initialize with ?refit=False? to avoid refitting if it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers Best, Sebastian > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis wrote: > > Greetings, > > I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? > > Thanks in advance for any hint. > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Sat Jan 7 13:49:03 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 7 Jan 2017 19:49:03 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> Message-ID: Hi Sebastian, Thanks, I will try it in another classification problem I have. However, this time I am using regressors not classifiers. On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: > Hi, Thomas, > > the VotingClassifier can combine different models per majority voting > amongst their predictions. Unfortunately, it refits the classifiers though > (after cloning them). I think we implemented it this way to make it > compatible to GridSearch and so forth. However, I have a version of the > estimator that you can initialize with ?refit=False? to avoid refitting if > it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/ > EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers > > Best, > Sebastian > > > > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis > wrote: > > > > Greetings, > > > > I have trained many MLPRegressors using different random_state value and > estimated the R^2 using cross-validation. Now I want to combine the top 10% > of them in how to get more accurate predictions. Is there a meta-estimator > that can get as input a few precomputed MLPRegressors and give consensus > predictions? Can the BaggingRegressor do this job using MLPRegressors as > input? > > > > Thanks in advance for any hint. > > Thomas > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sat Jan 7 15:20:55 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 7 Jan 2017 15:20:55 -0500 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> Message-ID: <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Hi, Thomas, sorry, I overread the regression part ? This would be a bit trickier, I am not sure what a good strategy for averaging regression outputs would be. However, if you just want to compute the average, you could do sth like np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) However, it may be better to use stacking, and use the output of r.predict(X) as meta features to train a model based on these? Best, Sebastian > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis wrote: > > Hi Sebastian, > > Thanks, I will try it in another classification problem I have. However, this time I am using regressors not classifiers. > > On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: > Hi, Thomas, > > the VotingClassifier can combine different models per majority voting amongst their predictions. Unfortunately, it refits the classifiers though (after cloning them). I think we implemented it this way to make it compatible to GridSearch and so forth. However, I have a version of the estimator that you can initialize with ?refit=False? to avoid refitting if it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers > > Best, > Sebastian > > > > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis wrote: > > > > Greetings, > > > > I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? > > > > Thanks in advance for any hint. > > Thomas > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ismaelfm_ at ciencias.unam.mx Sat Jan 7 15:52:10 2017 From: ismaelfm_ at ciencias.unam.mx (=?utf-8?Q?Jos=C3=A9_Ismael_Fern=C3=A1ndez_Mart=C3=ADnez?=) Date: Sat, 7 Jan 2017 14:52:10 -0600 Subject: [scikit-learn] Roc curve from multilabel classification has slope Message-ID: Hi, I have a multilabel classifier written in Keras from which I want to compute AUC and plot a ROC curve for every element classified from my test set. Everything seems fine, except that some elements have a roc curve that have a slope as follows: I don't know how to interpret the slope in such cases. Basically my workflow goes as follows, I have a pre-trained model, instance of Keras, and I have the features X and the binarized labels y, every element in y is an array of length 1000, as it is a multilabel classification problem each element in y might contain many 1s, indicating that the element belongs to multiples classes, so I used the built-in loss of binary_crossentropy and my outputs of the model prediction are score probailities. Then I plot the roc curve as follows. The predict method returns probabilities, as I'm using the functional api of keras. Does anyone knows why my roc curves looks like this? Ismael Sent from my iPhone -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.PNG Type: image/png Size: 132225 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image3.PNG Type: image/png Size: 42172 bytes Desc: not available URL: From tevang3 at gmail.com Sat Jan 7 16:36:37 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 7 Jan 2017 22:36:37 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: On 7 January 2017 at 21:20, Sebastian Raschka wrote: > Hi, Thomas, > sorry, I overread the regression part ? > This would be a bit trickier, I am not sure what a good strategy for > averaging regression outputs would be. However, if you just want to compute > the average, you could do sth like > np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) > > However, it may be better to use stacking, and use the output of > r.predict(X) as meta features to train a model based on these? > ?Like to train an SVR to combine the predictions of the top 10% MLPRegressors using the same data that were used for training of the MLPRegressors? Wouldn't that lead to overfitting? ? > > Best, > Sebastian > > > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis > wrote: > > > > Hi Sebastian, > > > > Thanks, I will try it in another classification problem I have. However, > this time I am using regressors not classifiers. > > > > On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: > > Hi, Thomas, > > > > the VotingClassifier can combine different models per majority voting > amongst their predictions. Unfortunately, it refits the classifiers though > (after cloning them). I think we implemented it this way to make it > compatible to GridSearch and so forth. However, I have a version of the > estimator that you can initialize with ?refit=False? to avoid refitting if > it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/ > EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers > > > > Best, > > Sebastian > > > > > > > > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis > wrote: > > > > > > Greetings, > > > > > > I have trained many MLPRegressors using different random_state value > and estimated the R^2 using cross-validation. Now I want to combine the top > 10% of them in how to get more accurate predictions. Is there a > meta-estimator that can get as input a few precomputed MLPRegressors and > give consensus predictions? Can the BaggingRegressor do this job using > MLPRegressors as input? > > > > > > Thanks in advance for any hint. > > > Thomas > > > > > > > > > -- > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/1S081, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 7 17:03:05 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 8 Jan 2017 09:03:05 +1100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: On 8 January 2017 at 08:36, Thomas Evangelidis wrote: > > > On 7 January 2017 at 21:20, Sebastian Raschka > wrote: > >> Hi, Thomas, >> sorry, I overread the regression part ? >> This would be a bit trickier, I am not sure what a good strategy for >> averaging regression outputs would be. However, if you just want to compute >> the average, you could do sth like >> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >> >> However, it may be better to use stacking, and use the output of >> r.predict(X) as meta features to train a model based on these? >> > > ?Like to train an SVR to combine the predictions of the top 10% > MLPRegressors using the same data that were used for training of the > MLPRegressors? Wouldn't that lead to overfitting? > You could certainly hold out a different data sample and that might indeed be valuable regularisation, but it's not obvious to me that this is substantially more prone to overfitting than just training a handful of MLPRegressors on the same data and having them vote by other means. There is no problem, in general, with overfitting, as long as your evaluation of an estimator's performance isn't biased towards the training set. We've not talked about overfitting. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 7 17:03:22 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 8 Jan 2017 09:03:22 +1100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: * > There is no problem, in general, with overfitting, as long as your > evaluation of an estimator's performance isn't biased towards the training > set. We've not talked about evaluation. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 7 17:04:14 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 8 Jan 2017 09:04:14 +1100 Subject: [scikit-learn] Roc curve from multilabel classification has slope In-Reply-To: References: Message-ID: predict method should not return probabilities in scikit-learn classifiers. predict_proba should. On 8 January 2017 at 07:52, Jos? Ismael Fern?ndez Mart?nez < ismaelfm_ at ciencias.unam.mx> wrote: > Hi, I have a multilabel classifier written in Keras from which I want to > compute AUC and plot a ROC curve for every element classified from my test > set. > > [image: image1.PNG] > > Everything seems fine, except that some elements have a roc curve that > have a slope as follows: > > [image: enter image description here] > I don't know how to interpret the > slope in such cases. > > Basically my workflow goes as follows, I have a pre-trained model, > instance of Keras, and I have the features X and the binarized labels y, > every element in y is an array of length 1000, as it is a multilabel > classification problem each element in y might contain many 1s, > indicating that the element belongs to multiples classes, so I used the > built-in loss of binary_crossentropy and my outputs of the model > prediction are score probailities. Then I plot the roc curve as follows. > > [image: image3.PNG] > > The predict method returns probabilities, as I'm using the functional api > of keras. > > Does anyone knows why my roc curves looks like this? > > > Ismael > > > Sent from my iPhone > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.PNG Type: image/png Size: 132225 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image3.PNG Type: image/png Size: 42172 bytes Desc: not available URL: From tevang3 at gmail.com Sat Jan 7 17:26:41 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sat, 7 Jan 2017 23:26:41 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Regarding the evaluation, I use the leave 20% out cross validation method. I cannot leave more out because my data sets are very small, between 30 and 40 observations, each one with 600 features. Is there a limit in the number of MLPRegressors I can combine with stacking considering my small data sets? On Jan 7, 2017 23:04, "Joel Nothman" wrote: > * > > >> There is no problem, in general, with overfitting, as long as your >> evaluation of an estimator's performance isn't biased towards the training >> set. We've not talked about evaluation. >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jan 7 18:04:08 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 7 Jan 2017 15:04:08 -0800 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: If you have such a small number of observations (with a much higher feature space) then why do you think you can accurately train not just a single MLP, but an ensemble of them without overfitting dramatically? On Sat, Jan 7, 2017 at 2:26 PM, Thomas Evangelidis wrote: > Regarding the evaluation, I use the leave 20% out cross validation method. > I cannot leave more out because my data sets are very small, between 30 and > 40 observations, each one with 600 features. Is there a limit in the number > of MLPRegressors I can combine with stacking considering my small data > sets? > > On Jan 7, 2017 23:04, "Joel Nothman" wrote: > >> * >> >> >>> There is no problem, in general, with overfitting, as long as your >>> evaluation of an estimator's performance isn't biased towards the training >>> set. We've not talked about evaluation. >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sat Jan 7 19:01:55 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 8 Jan 2017 01:01:55 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: On 8 January 2017 at 00:04, Jacob Schreiber wrote: > If you have such a small number of observations (with a much higher > feature space) then why do you think you can accurately train not just a > single MLP, but an ensemble of them without overfitting dramatically? > > > ?Because the observations in the data set don't differ much between them?. To be more specific, the data set consists of a congeneric series of organic molecules and the ebservation is their binding strength to a target protein. The idea was to train predictors that can predict the binding strenght of new molecules that belong to the same congeneric series. Therefore special care is taken to apply the predictors to the right domain of applicability. According to the literature, the same strategy has been followed in the past several times. The novelty of my approach stems from other factors that are irrelevant to this thread. -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ismaelfm_ at ciencias.unam.mx Sat Jan 7 19:32:49 2017 From: ismaelfm_ at ciencias.unam.mx (=?utf-8?Q?Jos=C3=A9_Ismael_Fern=C3=A1ndez_Mart=C3=ADnez?=) Date: Sat, 7 Jan 2017 18:32:49 -0600 Subject: [scikit-learn] Roc curve from multilabel classification has slope In-Reply-To: References: Message-ID: <6EEF6426-91D8-40D1-8FB8-E2F10D0327CA@ciencias.unam.mx> But is not a scikit-learn classifier, is a keras classifier which, in the functional API, predict returns probabilities. What I don't understand is why my plot of the roc curve has a slope, since I call roc_curve passing the actual label as y_true and the output of the classifier (score probabilities) as y_score for every element tested. Sent from my iPhone > On Jan 7, 2017, at 4:04 PM, Joel Nothman wrote: > > predict method should not return probabilities in scikit-learn classifiers. predict_proba should. > >> On 8 January 2017 at 07:52, Jos? Ismael Fern?ndez Mart?nez wrote: >> Hi, I have a multilabel classifier written in Keras from which I want to compute AUC and plot a ROC curve for every element classified from my test set. >> >> >> >> Everything seems fine, except that some elements have a roc curve that have a slope as follows: >> I don't know how to interpret the slope in such cases. >> >> Basically my workflow goes as follows, I have a pre-trained model, instance of Keras, and I have the features X and the binarized labels y, every element in y is an array of length 1000, as it is a multilabel classification problem each element in y might contain many 1s, indicating that the element belongs to multiples classes, so I used the built-in loss of binary_crossentropy and my outputs of the model prediction are score probailities. Then I plot the roc curve as follows. >> >> >> >> The predict method returns probabilities, as I'm using the functional api of keras. >> >> Does anyone knows why my roc curves looks like this? >> >> Ismael >> >> >> Sent from my iPhone >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jan 7 19:40:41 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 7 Jan 2017 16:40:41 -0800 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: This is an aside to what your original question was, but as someone who has dealt with similar data in bioinformatics (gene expression, specifically) I think you should tread -very- carefully if you have such a small sample set and more dimensions than features. MLPs are already prone to overfit and both of those factors would make me inherently suspicious of the results. This sounds like an easy way to trick yourself into thinking you are making good predictions. Perhaps consider LASSO? Back to the original question, it is true that using a SVR in a stacking technique would add more parameters to your model, but it is likely an insignificant amount when compared to the MLPs themselves. Alternatively you may consider using LASSO using all of the MLPs (not just the top 10%) so you can learn which ones yield useful features for a meta-estimator instead of just selecting the top 10%. On Sat, Jan 7, 2017 at 4:01 PM, Thomas Evangelidis wrote: > > > On 8 January 2017 at 00:04, Jacob Schreiber > wrote: > >> If you have such a small number of observations (with a much higher >> feature space) then why do you think you can accurately train not just a >> single MLP, but an ensemble of them without overfitting dramatically? >> >> >> > ?Because the observations in the data set don't differ much between them?. > To be more specific, the data set consists of a congeneric series of > organic molecules and the ebservation is their binding strength to a target > protein. The idea was to train predictors that can predict the binding > strenght of new molecules that belong to the same congeneric series. > Therefore special care is taken to apply the predictors to the right domain > of applicability. According to the literature, the same strategy has been > followed in the past several times. The novelty of my approach stems from > other factors that are irrelevant to this thread. > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jan 7 19:42:02 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 7 Jan 2017 16:42:02 -0800 Subject: [scikit-learn] Roc curve from multilabel classification has slope In-Reply-To: <6EEF6426-91D8-40D1-8FB8-E2F10D0327CA@ciencias.unam.mx> References: <6EEF6426-91D8-40D1-8FB8-E2F10D0327CA@ciencias.unam.mx> Message-ID: Slope usually means there are ties in your predictions. Check your dataset to see if you have repeated predicted values (possibly 1 or 0). On Sat, Jan 7, 2017 at 4:32 PM, Jos? Ismael Fern?ndez Mart?nez < ismaelfm_ at ciencias.unam.mx> wrote: > But is not a scikit-learn classifier, is a keras classifier which, in the > functional API, predict returns probabilities. > What I don't understand is why my plot of the roc curve has a slope, since > I call roc_curve passing the actual label as y_true and the output of the > classifier (score probabilities) as y_score for every element tested. > > > > Sent from my iPhone > On Jan 7, 2017, at 4:04 PM, Joel Nothman wrote: > > predict method should not return probabilities in scikit-learn > classifiers. predict_proba should. > > On 8 January 2017 at 07:52, Jos? Ismael Fern?ndez Mart?nez < > ismaelfm_ at ciencias.unam.mx> wrote: > >> Hi, I have a multilabel classifier written in Keras from which I want to >> compute AUC and plot a ROC curve for every element classified from my test >> set. >> >> >> >> Everything seems fine, except that some elements have a roc curve that >> have a slope as follows: >> >> [image: enter image description here] >> I don't know how to interpret the >> slope in such cases. >> >> Basically my workflow goes as follows, I have a pre-trained model, >> instance of Keras, and I have the features X and the binarized labels y, >> every element in y is an array of length 1000, as it is a multilabel >> classification problem each element in y might contain many 1s, >> indicating that the element belongs to multiples classes, so I used the >> built-in loss of binary_crossentropy and my outputs of the model >> prediction are score probailities. Then I plot the roc curve as follows. >> >> >> The predict method returns probabilities, as I'm using the functional api >> of keras. >> >> Does anyone knows why my roc curves looks like this? >> >> >> Ismael >> >> >> Sent from my iPhone >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Sun Jan 8 00:25:05 2017 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Sun, 8 Jan 2017 00:25:05 -0500 Subject: [scikit-learn] Jeff Levesque: Sample SVM / SVR dataset Message-ID: <7C01E03F-882B-45C2-A72B-54631180338F@yahoo.com> Hey guys, Im working on developing a web-interface, and programmatic api, to scikit-learn: - https://github.com/jeff1evesque/machine-learning However, I've only interfaced the SVM, and SVR classes. To be thorough, for development within git, I've created unit tests for the Travis CI. But, I made up some bogus datasets, in order to unit test the SVM, and SVR predictions: - dataset: https://github.com/jeff1evesque/machine-learning/tree/master/interface/static/data - unit tests: https://github.com/jeff1evesque/machine-learning/tree/master/test/live_server But, I'd prefer to have real data, so the computed prediction is more meaningful, instead of predicating on made up data. The corresponding unit tests I have, simply check if a prediction can be made for the supplied dataset. However, I'd like to check the prediction against a known, expected result, which is the motivation of having real meaningful dataset(s): - https://github.com/jeff1evesque/machine-learning/issues/2751 Does anyone have sample dataset(s) they have used for SVM, or SVR predictions? I'd like my unit tests to be somewhat interesting, yet more meaningful. Thank you, Jeff Levesque https://github.com/jeff1evesque From rth.yurchak at gmail.com Sun Jan 8 04:27:08 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Sun, 8 Jan 2017 10:27:08 +0100 Subject: [scikit-learn] Roc curve from multilabel classification has slope In-Reply-To: References: <6EEF6426-91D8-40D1-8FB8-E2F10D0327CA@ciencias.unam.mx> Message-ID: <587205EC.6060402@gmail.com> Jos?, I might be misunderstanding something, but wouldn't it make more sens to plot one ROC curve for every class in your result (using all samples at once), as opposed to plotting it for every training sample as you are doing now? Cf the example below, http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html Roman On 08/01/17 01:42, Jacob Schreiber wrote: > Slope usually means there are ties in your predictions. Check your > dataset to see if you have repeated predicted values (possibly 1 or 0). > > On Sat, Jan 7, 2017 at 4:32 PM, Jos? Ismael Fern?ndez Mart?nez > > wrote: > > But is not a scikit-learn classifier, is a keras classifier which, > in the functional API, predict returns probabilities. > What I don't understand is why my plot of the roc curve has a slope, > since I call roc_curve passing the actual label as y_true and the > output of the classifier (score probabilities) as y_score for every > element tested. > > > > Sent from my iPhone > On Jan 7, 2017, at 4:04 PM, Joel Nothman > wrote: > >> predict method should not return probabilities in scikit-learn >> classifiers. predict_proba should. >> >> On 8 January 2017 at 07:52, Jos? Ismael Fern?ndez Mart?nez >> > >> wrote: >> >> Hi, I have a multilabel classifier written in Keras from which >> I want to compute AUC and plot a ROC curve for every element >> classified from my test set. >> >> >> >> Everything seems fine, except that some elements have a roc >> curve that have a slope as follows: >> >> enter image description here >> I don't know how to >> interpret the slope in such cases. >> >> Basically my workflow goes as follows, I have a >> pre-trained |model|, instance of Keras, and I have the >> features |X| and the binarized labels |y|, every element >> in |y| is an array of length 1000, as it is a multilabel >> classification problem each element in |y| might contain many >> 1s, indicating that the element belongs to multiples classes, >> so I used the built-in loss of |binary_crossentropy| and my >> outputs of the model prediction are score probailities. Then I >> plot the roc curve as follows. >> >> >> The predict method returns probabilities, as I'm using the >> functional api of keras. >> >> Does anyone knows why my roc curves looks like this? >> >> >> Ismael >> >> >> >> Sent from my iPhone >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From se.raschka at gmail.com Sun Jan 8 05:53:53 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 8 Jan 2017 05:53:53 -0500 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: > Like to train an SVR to combine the predictions of the top 10% MLPRegressors using the same data that were used for training of the MLPRegressors? Wouldn't that lead to overfitting? It could, but you don't need to use the same data that you used for training to fit the meta estimator. Like it is commonly done in stacking with cross validation, you can train the mlps on training folds and pass predictions from a test fold to the meta estimator but then you'd have to retrain your mlps and it sounded like you wanted to avoid that. I am currently on mobile and only browsed through the thread briefly, but I agree with others that it may sound like your model(s) may have too much capacity for such a small dataset -- can be tricky to fit the parameters without overfitting. In any case, if you to do the stacking, I'd probably insert a k-fold cv between the mlps and the meta estimator. However I'd definitely also recommend simpler models als alternative. Best, Sebastian > On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis wrote: > > > >> On 7 January 2017 at 21:20, Sebastian Raschka wrote: >> Hi, Thomas, >> sorry, I overread the regression part ? >> This would be a bit trickier, I am not sure what a good strategy for averaging regression outputs would be. However, if you just want to compute the average, you could do sth like >> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >> >> However, it may be better to use stacking, and use the output of r.predict(X) as meta features to train a model based on these? > > ?Like to train an SVR to combine the predictions of the top 10% MLPRegressors using the same data that were used for training of the MLPRegressors? Wouldn't that lead to overfitting? > ? >> >> Best, >> Sebastian >> >> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis wrote: >> > >> > Hi Sebastian, >> > >> > Thanks, I will try it in another classification problem I have. However, this time I am using regressors not classifiers. >> > >> > On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: >> > Hi, Thomas, >> > >> > the VotingClassifier can combine different models per majority voting amongst their predictions. Unfortunately, it refits the classifiers though (after cloning them). I think we implemented it this way to make it compatible to GridSearch and so forth. However, I have a version of the estimator that you can initialize with ?refit=False? to avoid refitting if it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers >> > >> > Best, >> > Sebastian >> > >> > >> > >> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis wrote: >> > > >> > > Greetings, >> > > >> > > I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? >> > > >> > > Thanks in advance for any hint. >> > > Thomas >> > > >> > > >> > > -- >> > > ====================================================================== >> > > Thomas Evangelidis >> > > Research Specialist >> > > CEITEC - Central European Institute of Technology >> > > Masaryk University >> > > Kamenice 5/A35/1S081, >> > > 62500 Brno, Czech Republic >> > > >> > > email: tevang at pharm.uoa.gr >> > > tevang3 at gmail.com >> > > >> > > website: https://sites.google.com/site/thomasevangelidishomepage/ >> > > >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Jan 8 06:42:09 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 8 Jan 2017 12:42:09 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Sebastian and Jacob, Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor performance on my data. MLPregressors are way superior. On an other note, MLPregressor class has some methods to contol overfitting, like controling the alpha parameter for the L2 regularization (maybe setting it to a high value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) or even "early_stopping=True". Wouldn't these be sufficient to be on the safe side. Once more I want to highlight something I wrote previously but might have been overlooked. The resulting MLPRegressors will be applied to new datasets that *ARE VERY SIMILAR TO THE TRAINING DATA*. In other words the application of the models will be strictly confined to their applicability domain. Wouldn't that be sufficient to not worry about model overfitting too much? On 8 January 2017 at 11:53, Sebastian Raschka wrote: > Like to train an SVR to combine the predictions of the top 10% > MLPRegressors using the same data that were used for training of the > MLPRegressors? Wouldn't that lead to overfitting? > > > It could, but you don't need to use the same data that you used for > training to fit the meta estimator. Like it is commonly done in stacking > with cross validation, you can train the mlps on training folds and pass > predictions from a test fold to the meta estimator but then you'd have to > retrain your mlps and it sounded like you wanted to avoid that. > > I am currently on mobile and only browsed through the thread briefly, but > I agree with others that it may sound like your model(s) may have too much > capacity for such a small dataset -- can be tricky to fit the parameters > without overfitting. In any case, if you to do the stacking, I'd probably > insert a k-fold cv between the mlps and the meta estimator. However I'd > definitely also recommend simpler models als > alternative. > > Best, > Sebastian > > On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis wrote: > > > > On 7 January 2017 at 21:20, Sebastian Raschka > wrote: > >> Hi, Thomas, >> sorry, I overread the regression part ? >> This would be a bit trickier, I am not sure what a good strategy for >> averaging regression outputs would be. However, if you just want to compute >> the average, you could do sth like >> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >> >> However, it may be better to use stacking, and use the output of >> r.predict(X) as meta features to train a model based on these? >> > > ?Like to train an SVR to combine the predictions of the top 10% > MLPRegressors using the same data that were used for training of the > MLPRegressors? Wouldn't that lead to overfitting? > ? > > >> >> Best, >> Sebastian >> >> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis >> wrote: >> > >> > Hi Sebastian, >> > >> > Thanks, I will try it in another classification problem I have. >> However, this time I am using regressors not classifiers. >> > >> > On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: >> > Hi, Thomas, >> > >> > the VotingClassifier can combine different models per majority voting >> amongst their predictions. Unfortunately, it refits the classifiers though >> (after cloning them). I think we implemented it this way to make it >> compatible to GridSearch and so forth. However, I have a version of the >> estimator that you can initialize with ?refit=False? to avoid refitting if >> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/Ensembl >> eVoteClassifier/#example-5-using-pre-fitted-classifiers >> > >> > Best, >> > Sebastian >> > >> > >> > >> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis >> wrote: >> > > >> > > Greetings, >> > > >> > > I have trained many MLPRegressors using different random_state value >> and estimated the R^2 using cross-validation. Now I want to combine the top >> 10% of them in how to get more accurate predictions. Is there a >> meta-estimator that can get as input a few precomputed MLPRegressors and >> give consensus predictions? Can the BaggingRegressor do this job using >> MLPRegressors as input? >> > > >> > > Thanks in advance for any hint. >> > > Thomas >> > > >> > > >> > > -- >> > > ============================================================ >> ========== >> > > Thomas Evangelidis >> > > Research Specialist >> > > CEITEC - Central European Institute of Technology >> > > Masaryk University >> > > Kamenice 5/A35/1S081, >> > > 62500 Brno, Czech Republic >> > > >> > > email: tevang at pharm.uoa.gr >> > > tevang3 at gmail.com >> > > >> > > website: https://sites.google.com/site/thomasevangelidishomepage/ >> > > >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Jan 8 23:08:53 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 9 Jan 2017 15:08:53 +1100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Btw, I may have been unclear in the discussion of overfitting. For *training* the meta-estimator in stacking, it's standard to do something like cross_val_predict on your training set to produce its input features. On 8 January 2017 at 22:42, Thomas Evangelidis wrote: > Sebastian and Jacob, > > Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor > performance on my data. MLPregressors are way superior. On an other note, > MLPregressor class has some methods to contol overfitting, like controling > the alpha parameter for the L2 regularization (maybe setting it to a high > value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) > or even "early_stopping=True". Wouldn't these be sufficient to be on the > safe side. > > Once more I want to highlight something I wrote previously but might have > been overlooked. The resulting MLPRegressors will be applied to new > datasets that *ARE VERY SIMILAR TO THE TRAINING DATA*. In other words the > application of the models will be strictly confined to their applicability > domain. Wouldn't that be sufficient to not worry about model overfitting > too much? > > > > > > On 8 January 2017 at 11:53, Sebastian Raschka > wrote: > >> Like to train an SVR to combine the predictions of the top 10% >> MLPRegressors using the same data that were used for training of the >> MLPRegressors? Wouldn't that lead to overfitting? >> >> >> It could, but you don't need to use the same data that you used for >> training to fit the meta estimator. Like it is commonly done in stacking >> with cross validation, you can train the mlps on training folds and pass >> predictions from a test fold to the meta estimator but then you'd have to >> retrain your mlps and it sounded like you wanted to avoid that. >> >> I am currently on mobile and only browsed through the thread briefly, but >> I agree with others that it may sound like your model(s) may have too much >> capacity for such a small dataset -- can be tricky to fit the parameters >> without overfitting. In any case, if you to do the stacking, I'd probably >> insert a k-fold cv between the mlps and the meta estimator. However I'd >> definitely also recommend simpler models als >> alternative. >> >> Best, >> Sebastian >> >> On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis wrote: >> >> >> >> On 7 January 2017 at 21:20, Sebastian Raschka >> wrote: >> >>> Hi, Thomas, >>> sorry, I overread the regression part ? >>> This would be a bit trickier, I am not sure what a good strategy for >>> averaging regression outputs would be. However, if you just want to compute >>> the average, you could do sth like >>> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >>> >>> However, it may be better to use stacking, and use the output of >>> r.predict(X) as meta features to train a model based on these? >>> >> >> ?Like to train an SVR to combine the predictions of the top 10% >> MLPRegressors using the same data that were used for training of the >> MLPRegressors? Wouldn't that lead to overfitting? >> ? >> >> >>> >>> Best, >>> Sebastian >>> >>> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis >>> wrote: >>> > >>> > Hi Sebastian, >>> > >>> > Thanks, I will try it in another classification problem I have. >>> However, this time I am using regressors not classifiers. >>> > >>> > On Jan 7, 2017 19:28, "Sebastian Raschka" >>> wrote: >>> > Hi, Thomas, >>> > >>> > the VotingClassifier can combine different models per majority voting >>> amongst their predictions. Unfortunately, it refits the classifiers though >>> (after cloning them). I think we implemented it this way to make it >>> compatible to GridSearch and so forth. However, I have a version of the >>> estimator that you can initialize with ?refit=False? to avoid refitting if >>> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/Ensembl >>> eVoteClassifier/#example-5-using-pre-fitted-classifiers >>> > >>> > Best, >>> > Sebastian >>> > >>> > >>> > >>> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis >>> wrote: >>> > > >>> > > Greetings, >>> > > >>> > > I have trained many MLPRegressors using different random_state value >>> and estimated the R^2 using cross-validation. Now I want to combine the top >>> 10% of them in how to get more accurate predictions. Is there a >>> meta-estimator that can get as input a few precomputed MLPRegressors and >>> give consensus predictions? Can the BaggingRegressor do this job using >>> MLPRegressors as input? >>> > > >>> > > Thanks in advance for any hint. >>> > > Thomas >>> > > >>> > > >>> > > -- >>> > > ============================================================ >>> ========== >>> > > Thomas Evangelidis >>> > > Research Specialist >>> > > CEITEC - Central European Institute of Technology >>> > > Masaryk University >>> > > Kamenice 5/A35/1S081, >>> > > 62500 Brno, Czech Republic >>> > > >>> > > email: tevang at pharm.uoa.gr >>> > > tevang3 at gmail.com >>> > > >>> > > website: https://sites.google.com/site/thomasevangelidishomepage/ >>> > > >>> > > >>> > > _______________________________________________ >>> > > scikit-learn mailing list >>> > > scikit-learn at python.org >>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Jan 9 04:48:41 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 9 Jan 2017 10:48:41 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release Message-ID: Hi all, I think we should release 0.18.2 to get some important fixes and make it easy to release Python 3.6 wheel package for all the operating systems using the automated procedure. I identified a couple of PR to backport to 0.18.X to prepare the 0.18.2 release. Are there any other important recently fixed bugfs people would like to see backported in this release? https://github.com/scikit-learn/scikit-learn/milestone/23?closed=1 Best, -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From joel.nothman at gmail.com Mon Jan 9 06:04:05 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 9 Jan 2017 22:04:05 +1100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: In terms of the bug fixes listed in the change-log, most seem non-urgent. I would consider pulling across #7954, #8006, #8087, #7872, #7983. But I also wonder whether we'd be better off sprinting towards a small 0.19 release. On 9 January 2017 at 20:48, Olivier Grisel wrote: > Hi all, > > I think we should release 0.18.2 to get some important fixes and make > it easy to release Python 3.6 wheel package for all the operating > systems using the automated procedure. > > I identified a couple of PR to backport to 0.18.X to prepare the > 0.18.2 release. Are there any other important recently fixed bugfs > people would like to see backported in this release? > > https://github.com/scikit-learn/scikit-learn/milestone/23?closed=1 > > Best, > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Jan 9 09:43:10 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 9 Jan 2017 15:43:10 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: In retrospect, making a small 0.19 release is probably a good idea. I would like to get https://github.com/scikit-learn/scikit-learn/pull/8002 in before cutting the 0.19.X branch. -- Olivier Grisel From ragvrv at gmail.com Mon Jan 9 10:06:35 2017 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 9 Jan 2017 16:06:35 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: I think it would be nice to have 0.19 by April. We'd have 3 more months and we can frame some roadmap towards it? On Mon, Jan 9, 2017 at 3:43 PM, Olivier Grisel wrote: > In retrospect, making a small 0.19 release is probably a good idea. > > I would like to get > https://github.com/scikit-learn/scikit-learn/pull/8002 in before > cutting the 0.19.X branch. > > -- > Olivier Grisel > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Mon Jan 9 10:07:53 2017 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 9 Jan 2017 16:07:53 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: (So we can get back to the one release per 4 month cycle?) On Mon, Jan 9, 2017 at 4:06 PM, Raghav R V wrote: > I think it would be nice to have 0.19 by April. We'd have 3 more months > and we can frame some roadmap towards it? > > On Mon, Jan 9, 2017 at 3:43 PM, Olivier Grisel > wrote: > >> In retrospect, making a small 0.19 release is probably a good idea. >> >> I would like to get >> https://github.com/scikit-learn/scikit-learn/pull/8002 in before >> cutting the 0.19.X branch. >> >> -- >> Olivier Grisel >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > Raghav RV > https://github.com/raghavrv > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Jan 9 10:12:08 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 9 Jan 2017 16:12:08 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: I would rather like to get it out before April ideally and instead of setting up a roadmap I would rather just identify bugs that are blockers and fix only those and don't wait for any feature before cutting 0.19.X. -- Olivier From gael.varoquaux at normalesup.org Mon Jan 9 10:15:46 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 9 Jan 2017 16:15:46 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: <20170109151546.GM2802991@phare.normalesup.org> > instead of setting up a roadmap I would rather just identify bugs that > are blockers and fix only those and don't wait for any feature before > cutting 0.19.X. +1 From raga.markely at gmail.com Mon Jan 9 11:29:28 2017 From: raga.markely at gmail.com (Raga Markely) Date: Mon, 9 Jan 2017 11:29:28 -0500 Subject: [scikit-learn] Generalized Discriminant Analysis with Kernel Message-ID: Hello, I wonder if scikit-learn has implementation for generalized discriminant analysis using kernel approach? http://www.kernel-machines.org/papers/upload_21840_GDA.pdf I did some search, but couldn't find. Thank you, Raga -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jan 9 13:21:59 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 9 Jan 2017 10:21:59 -0800 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Thomas, it can be difficult to fine tune L1/L2 regularization in the case where n_parameters >>> n_samples ~and~ n_features >> n_samples. If your samples are very similar to the training data, why are simpler models not working well? On Sun, Jan 8, 2017 at 8:08 PM, Joel Nothman wrote: > Btw, I may have been unclear in the discussion of overfitting. For > *training* the meta-estimator in stacking, it's standard to do something > like cross_val_predict on your training set to produce its input features. > > On 8 January 2017 at 22:42, Thomas Evangelidis wrote: > >> Sebastian and Jacob, >> >> Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor >> performance on my data. MLPregressors are way superior. On an other note, >> MLPregressor class has some methods to contol overfitting, like controling >> the alpha parameter for the L2 regularization (maybe setting it to a high >> value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) >> or even "early_stopping=True". Wouldn't these be sufficient to be on the >> safe side. >> >> Once more I want to highlight something I wrote previously but might have >> been overlooked. The resulting MLPRegressors will be applied to new >> datasets that *ARE VERY SIMILAR TO THE TRAINING DATA*. In other words >> the application of the models will be strictly confined to their >> applicability domain. Wouldn't that be sufficient to not worry about model >> overfitting too much? >> >> >> >> >> >> On 8 January 2017 at 11:53, Sebastian Raschka >> wrote: >> >>> Like to train an SVR to combine the predictions of the top 10% >>> MLPRegressors using the same data that were used for training of the >>> MLPRegressors? Wouldn't that lead to overfitting? >>> >>> >>> It could, but you don't need to use the same data that you used for >>> training to fit the meta estimator. Like it is commonly done in stacking >>> with cross validation, you can train the mlps on training folds and pass >>> predictions from a test fold to the meta estimator but then you'd have to >>> retrain your mlps and it sounded like you wanted to avoid that. >>> >>> I am currently on mobile and only browsed through the thread briefly, >>> but I agree with others that it may sound like your model(s) may have too >>> much capacity for such a small dataset -- can be tricky to fit the >>> parameters without overfitting. In any case, if you to do the stacking, I'd >>> probably insert a k-fold cv between the mlps and the meta estimator. >>> However I'd definitely also recommend simpler models als >>> alternative. >>> >>> Best, >>> Sebastian >>> >>> On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis >>> wrote: >>> >>> >>> >>> On 7 January 2017 at 21:20, Sebastian Raschka >>> wrote: >>> >>>> Hi, Thomas, >>>> sorry, I overread the regression part ? >>>> This would be a bit trickier, I am not sure what a good strategy for >>>> averaging regression outputs would be. However, if you just want to compute >>>> the average, you could do sth like >>>> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >>>> >>>> However, it may be better to use stacking, and use the output of >>>> r.predict(X) as meta features to train a model based on these? >>>> >>> >>> ?Like to train an SVR to combine the predictions of the top 10% >>> MLPRegressors using the same data that were used for training of the >>> MLPRegressors? Wouldn't that lead to overfitting? >>> ? >>> >>> >>>> >>>> Best, >>>> Sebastian >>>> >>>> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis >>>> wrote: >>>> > >>>> > Hi Sebastian, >>>> > >>>> > Thanks, I will try it in another classification problem I have. >>>> However, this time I am using regressors not classifiers. >>>> > >>>> > On Jan 7, 2017 19:28, "Sebastian Raschka" >>>> wrote: >>>> > Hi, Thomas, >>>> > >>>> > the VotingClassifier can combine different models per majority voting >>>> amongst their predictions. Unfortunately, it refits the classifiers though >>>> (after cloning them). I think we implemented it this way to make it >>>> compatible to GridSearch and so forth. However, I have a version of the >>>> estimator that you can initialize with ?refit=False? to avoid refitting if >>>> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/Ensembl >>>> eVoteClassifier/#example-5-using-pre-fitted-classifiers >>>> > >>>> > Best, >>>> > Sebastian >>>> > >>>> > >>>> > >>>> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis >>>> wrote: >>>> > > >>>> > > Greetings, >>>> > > >>>> > > I have trained many MLPRegressors using different random_state >>>> value and estimated the R^2 using cross-validation. Now I want to combine >>>> the top 10% of them in how to get more accurate predictions. Is there a >>>> meta-estimator that can get as input a few precomputed MLPRegressors and >>>> give consensus predictions? Can the BaggingRegressor do this job using >>>> MLPRegressors as input? >>>> > > >>>> > > Thanks in advance for any hint. >>>> > > Thomas >>>> > > >>>> > > >>>> > > -- >>>> > > ============================================================ >>>> ========== >>>> > > Thomas Evangelidis >>>> > > Research Specialist >>>> > > CEITEC - Central European Institute of Technology >>>> > > Masaryk University >>>> > > Kamenice 5/A35/1S081, >>>> > > 62500 Brno, Czech Republic >>>> > > >>>> > > email: tevang at pharm.uoa.gr >>>> > > tevang3 at gmail.com >>>> > > >>>> > > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>> > > >>>> > > >>>> > > _______________________________________________ >>>> > > scikit-learn mailing list >>>> > > scikit-learn at python.org >>>> > > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Thomas Evangelidis >>> >>> Research Specialist >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smith_r at ligo.caltech.edu Mon Jan 9 14:34:26 2017 From: smith_r at ligo.caltech.edu (Rory Smith) Date: Mon, 9 Jan 2017 11:34:26 -0800 Subject: [scikit-learn] Complex variables in Gaussian mixture models? Message-ID: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> Hi All, I?d like to set up a GMM using mixture.BayesianGaussianMixture to model a probability density of complex random variables (the learned means and covariances should also be complex valued). I wasn?t able to see any mention of how to handle complex variables in the documentation so I?m curious if it?s possible in the current implementation. I tried the obvious thing of first generating a 1D array of complex random numbers, but I see these warning when I try and fit the array X using dpgmm = mixture.BayesianGaussianMixture(n_components=4, covariance_type='full', n_init=1).fit(X) ~/miniconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:382: ComplexWarning: Casting complex values to real discards the imaginary part array = np.array(array, dtype=dtype, order=order, copy=copy) And as might be expected from the warning, the learned means are real. Any advice on this problem would be greatly appreciated! Best, Rory -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Mon Jan 9 15:42:24 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Mon, 09 Jan 2017 20:42:24 +0000 Subject: [scikit-learn] Complex variables in Gaussian mixture models? In-Reply-To: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> References: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> Message-ID: Probably not the most principled way to handle it, but: can't you treat 1 dimensional complex numbers as 2 dimensional real numbers, and then try to cluster those with the GMM? On Mon, 9 Jan 2017 at 20:34 Rory Smith wrote: > Hi All, > > I?d like to set up a GMM using mixture.BayesianGaussianMixture to model a > probability density of complex random variables (the learned means and > covariances should also be complex valued). I wasn?t able to see any > mention of how to handle complex variables in the documentation so I?m > curious if it?s possible in the current implementation. > I tried the obvious thing of first generating a 1D array of complex > random numbers, but I see these warning when I try and fit the array X > using > > dpgmm = mixture.BayesianGaussianMixture(n_components=4, > covariance_type='full', n_init=1 > ).fit(X) > > ~/miniconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:382: > ComplexWarning: Casting complex values to real discards the imaginary part > array = np.array(array, dtype=dtype, order=order, copy=copy) > > > And as might be expected from the warning, the learned means are real. > > Any advice on this problem would be greatly appreciated! > > Best, > Rory > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jan 9 15:43:23 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 9 Jan 2017 12:43:23 -0800 Subject: [scikit-learn] Complex variables in Gaussian mixture models? In-Reply-To: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> References: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> Message-ID: I'm not too familiar with how complex values are traditionally treated, but is it possible to make the complex component a real valued component and treat it just as having twice as many features? On Mon, Jan 9, 2017 at 11:34 AM, Rory Smith wrote: > Hi All, > > I?d like to set up a GMM using mixture.BayesianGaussianMixture to model a > probability density of complex random variables (the learned means and > covariances should also be complex valued). I wasn?t able to see any > mention of how to handle complex variables in the documentation so I?m > curious if it?s possible in the current implementation. > I tried the obvious thing of first generating a 1D array of complex > random numbers, but I see these warning when I try and fit the array X > using > > dpgmm = mixture.BayesianGaussianMixture(n_components=4, > covariance_type='full', n_init=1 > ).fit(X) > > ~/miniconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:382: > ComplexWarning: Casting complex values to real discards the imaginary part > array = np.array(array, dtype=dtype, order=order, copy=copy) > > > And as might be expected from the warning, the learned means are real. > > Any advice on this problem would be greatly appreciated! > > Best, > Rory > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Jan 9 17:55:57 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 9 Jan 2017 17:55:57 -0500 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: > Once more I want to highlight something I wrote previously but might have been overlooked. The resulting MLPRegressors will be applied to new datasets that ARE VERY SIMILAR TO THE TRAINING DATA. In other words the application of the models will be strictly confined to their applicability domain. Wouldn't that be sufficient to not worry about model overfitting too much? If you have a very small dataset and a very large number of features, I?d always be careful with/avoid models that have a high capacity. However, it is really hard to answer this question because we don?t know much about your training and evaluation approach. If you didn?t do much hyperparameter tuning and cross-validation for model selection, and if you set aside a sufficiently large portion as an independent test set that you only looked at once and get a good performance on that, you may be lucky and a complex MLP may generalize well. However, like others said, it?s really hard to get an MLP right (not memorizing training data) if n_samples is small and n_features is large. And for n_features > n_samples, that may be very, very hard. > like controling the alpha parameter for the L2 regularization (maybe setting it to a high value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) or even "early_stopping=True" As a rule of thumb, the higher the capacity the higher the degree/chance of overfitting. So yes, this could help a little bit. You probably also want to try dropout instead of L2 (or in addition), which usually has a stronger effect on regularization (esp. if you have a very large set of redundant features). Can?t remember the exact paper, but I read about an approach where the authors set a max constraint for the weights in combination with dropout, e.g. ? ||w||_2 < constant ?, which worked even better than dropout alone (the constant becomes another hyperparm to tune though). Best, Sebastian > On Jan 9, 2017, at 1:21 PM, Jacob Schreiber wrote: > > Thomas, it can be difficult to fine tune L1/L2 regularization in the case where n_parameters >>> n_samples ~and~ n_features >> n_samples. If your samples are very similar to the training data, why are simpler models not working well? > > > > On Sun, Jan 8, 2017 at 8:08 PM, Joel Nothman wrote: > Btw, I may have been unclear in the discussion of overfitting. For *training* the meta-estimator in stacking, it's standard to do something like cross_val_predict on your training set to produce its input features. > > On 8 January 2017 at 22:42, Thomas Evangelidis wrote: > Sebastian and Jacob, > > Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor performance on my data. MLPregressors are way superior. On an other note, MLPregressor class has some methods to contol overfitting, like controling the alpha parameter for the L2 regularization (maybe setting it to a high value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) or even "early_stopping=True". Wouldn't these be sufficient to be on the safe side. > > Once more I want to highlight something I wrote previously but might have been overlooked. The resulting MLPRegressors will be applied to new datasets that ARE VERY SIMILAR TO THE TRAINING DATA. In other words the application of the models will be strictly confined to their applicability domain. Wouldn't that be sufficient to not worry about model overfitting too much? > > > > > > On 8 January 2017 at 11:53, Sebastian Raschka wrote: >> Like to train an SVR to combine the predictions of the top 10% MLPRegressors using the same data that were used for training of the MLPRegressors? Wouldn't that lead to overfitting? > > It could, but you don't need to use the same data that you used for training to fit the meta estimator. Like it is commonly done in stacking with cross validation, you can train the mlps on training folds and pass predictions from a test fold to the meta estimator but then you'd have to retrain your mlps and it sounded like you wanted to avoid that. > > I am currently on mobile and only browsed through the thread briefly, but I agree with others that it may sound like your model(s) may have too much capacity for such a small dataset -- can be tricky to fit the parameters without overfitting. In any case, if you to do the stacking, I'd probably insert a k-fold cv between the mlps and the meta estimator. However I'd definitely also recommend simpler models als > alternative. > > Best, > Sebastian > > On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis wrote: > >> >> >> On 7 January 2017 at 21:20, Sebastian Raschka wrote: >> Hi, Thomas, >> sorry, I overread the regression part ? >> This would be a bit trickier, I am not sure what a good strategy for averaging regression outputs would be. However, if you just want to compute the average, you could do sth like >> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >> >> However, it may be better to use stacking, and use the output of r.predict(X) as meta features to train a model based on these? >> >> ?Like to train an SVR to combine the predictions of the top 10% MLPRegressors using the same data that were used for training of the MLPRegressors? Wouldn't that lead to overfitting? >> ? >> >> Best, >> Sebastian >> >> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis wrote: >> > >> > Hi Sebastian, >> > >> > Thanks, I will try it in another classification problem I have. However, this time I am using regressors not classifiers. >> > >> > On Jan 7, 2017 19:28, "Sebastian Raschka" wrote: >> > Hi, Thomas, >> > >> > the VotingClassifier can combine different models per majority voting amongst their predictions. Unfortunately, it refits the classifiers though (after cloning them). I think we implemented it this way to make it compatible to GridSearch and so forth. However, I have a version of the estimator that you can initialize with ?refit=False? to avoid refitting if it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers >> > >> > Best, >> > Sebastian >> > >> > >> > >> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis wrote: >> > > >> > > Greetings, >> > > >> > > I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? >> > > >> > > Thanks in advance for any hint. >> > > Thomas >> > > >> > > >> > > -- >> > > ====================================================================== >> > > Thomas Evangelidis >> > > Research Specialist >> > > CEITEC - Central European Institute of Technology >> > > Masaryk University >> > > Kamenice 5/A35/1S081, >> > > 62500 Brno, Czech Republic >> > > >> > > email: tevang at pharm.uoa.gr >> > > tevang3 at gmail.com >> > > >> > > website: https://sites.google.com/site/thomasevangelidishomepage/ >> > > >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> -- >> ====================================================================== >> Thomas Evangelidis >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> tevang3 at gmail.com >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Mon Jan 9 18:40:59 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 10 Jan 2017 00:40:59 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Jacob & Sebastian, I think the best way to find out if my modeling approach works is to find a larger dataset, split it into two parts, the first one will be used as training/cross-validation set and the second as a test set, like in a real case scenario. Regarding the MLPRegressor regularization, below is my optimum setup: MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, > validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) This means only one hidden layer with maximum 10 neurons, alpha=10 for L2 regularization and early stopping to terminate training if validation score is not improving. I think this is a quite simple model. My final predictor is an SVR that combines 2 MLPRegressors, each one trained with different types of input data. @Sebastian You have mentioned dropout again but I could not find it in the docs: http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor Maybe you are referring to another MLPRegressor implementation? I have seen a while ago another implementation you had on github. Can you clarify which one you recommend and why? Thank you both of you for your hints! best Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Mon Jan 9 19:21:09 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 10 Jan 2017 00:21:09 +0000 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: If you dont have a large dataset, you can still do leave one out cross validation. On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis wrote: > > Jacob & Sebastian, > > I think the best way to find out if my modeling approach works is to find > a larger dataset, split it into two parts, the first one will be used as > training/cross-validation set and the second as a test set, like in a real > case scenario. > > Regarding the MLPRegressor regularization, below is my optimum setup: > > MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, > validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) > > > This means only one hidden layer with maximum 10 neurons, alpha=10 for L2 > regularization and early stopping to terminate training if validation score > is not improving. I think this is a quite simple model. My final predictor > is an SVR that combines 2 MLPRegressors, each one trained with different > types of input data. > > @Sebastian > You have mentioned dropout again but I could not find it in the docs: > > http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor > > Maybe you are referring to another MLPRegressor implementation? I have > seen a while ago another implementation you had on github. Can you clarify > which one you recommend and why? > > > Thank you both of you for your hints! > > best > Thomas > > > > -- > > > > > > > > > > > > > > > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > website: > > https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jan 9 19:36:41 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 9 Jan 2017 16:36:41 -0800 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Even with a single layer with 10 neurons you're still trying to train over 6000 parameters using ~30 samples. Dropout is a concept common in neural networks, but doesn't appear to be in sklearn's implementation of MLPs. Early stopping based on validation performance isn't an "extra" step for reducing overfitting, it's basically a required step for neural networks. It seems like you have a validation sample of ~6 datapoints.. I'm still very skeptical of that giving you proper results for a complex model. Will this larger dataset be of exactly the same data? Just taking another unrelated dataset and showing that a MLP can learn it doesn't mean it will work for your specific data. Can you post the actual results from using LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds wrote: > If you dont have a large dataset, you can still do leave one out cross > validation. > > On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis > wrote: > >> >> Jacob & Sebastian, >> >> I think the best way to find out if my modeling approach works is to find >> a larger dataset, split it into two parts, the first one will be used as >> training/cross-validation set and the second as a test set, like in a real >> case scenario. >> >> Regarding the MLPRegressor regularization, below is my optimum setup: >> >> MLPRegressor(random_state=random_state, max_iter=400, >> early_stopping=True, validation_fraction=0.2, alpha=10, >> hidden_layer_sizes=(10,)) >> >> >> This means only one hidden layer with maximum 10 neurons, alpha=10 for L2 >> regularization and early stopping to terminate training if validation score >> is not improving. I think this is a quite simple model. My final predictor >> is an SVR that combines 2 MLPRegressors, each one trained with different >> types of input data. >> >> @Sebastian >> You have mentioned dropout again but I could not find it in the docs: >> http://scikit-learn.org/stable/modules/generated/sklearn.neural_network. >> MLPRegressor.html#sklearn.neural_network.MLPRegressor >> >> Maybe you are referring to another MLPRegressor implementation? I have >> seen a while ago another implementation you had on github. Can you clarify >> which one you recommend and why? >> >> >> Thank you both of you for your hints! >> >> best >> Thomas >> >> >> >> -- >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ====================================================================== >> >> >> Thomas Evangelidis >> >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> >> tevang3 at gmail.com >> >> >> >> website: >> >> https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smith_r at ligo.caltech.edu Mon Jan 9 20:16:37 2017 From: smith_r at ligo.caltech.edu (Rory Smith) Date: Mon, 9 Jan 2017 17:16:37 -0800 Subject: [scikit-learn] Complex variables in Gaussian mixture models? In-Reply-To: References: <1A6E40A6-5019-44F8-BF56-EC382E8908FD@ligo.caltech.edu> Message-ID: Hi Jacob, Fredrico It should be possible to treat the problem as one of having twice as many real features, but it comes at the expense of more complex code on the user's side and extra bookkeeping that would be nice to have scikit handle under the hood. I would expect that all the tricks needed to break up a Gaussian Kernel of complex variables into real and imaginary components would be relatively simple to implement within the source code. Do you think that this is worth submitting an issue to the issue tracker? (I?m not familiar with Best, Rory > On Jan 9, 2017, at 12:43 PM, Jacob Schreiber wrote: > > I'm not too familiar with how complex values are traditionally treated, but is it possible to make the complex component a real valued component and treat it just as having twice as many features? > > On Mon, Jan 9, 2017 at 11:34 AM, Rory Smith > wrote: > Hi All, > > I?d like to set up a GMM using mixture.BayesianGaussianMixture to model a probability density of complex random variables (the learned means and covariances should also be complex valued). I wasn?t able to see any mention of how to handle complex variables in the documentation so I?m curious if it?s possible in the current implementation. > I tried the obvious thing of first generating a 1D array of complex random numbers, but I see these warning when I try and fit the array X using > > dpgmm = mixture.BayesianGaussianMixture(n_components=4, > covariance_type='full', n_init=1).fit(X) > > ~/miniconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:382: ComplexWarning: Casting complex values to real discards the imaginary part > array = np.array(array, dtype=dtype, order=order, copy=copy) > > > And as might be expected from the warning, the learned means are real. > > Any advice on this problem would be greatly appreciated! > > Best, > Rory > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From avn at mccme.ru Tue Jan 10 03:58:59 2017 From: avn at mccme.ru (avn at mccme.ru) Date: Tue, 10 Jan 2017 11:58:59 +0300 Subject: [scikit-learn] Generalized Discriminant Analysis with Kernel In-Reply-To: References: Message-ID: Hi Raga, You may try approximating your kernel using Nystroem kernel approximator (kernel_approximation.Nystroem) and then apply LDA to the transformed feature vectors. If you choose dimensionality of the target space (n_components) large enough (depending on your kernel and data), Nystroem approximator should provide sufficiently good kernel approximation for such combination to approximate GDA. Raga Markely ????? 2017-01-09 19:29: > Hello, > > I wonder if scikit-learn has implementation for generalized > discriminant analysis using kernel approach? > http://www.kernel-machines.org/papers/upload_21840_GDA.pdf > > I did some search, but couldn't find. > > Thank you, > Raga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Tue Jan 10 07:46:58 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 10 Jan 2017 13:46:58 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Jacob, The features are not 6000. I train 2 MLPRegressors from two types of data, both refer to the same dataset (35 molecules in total) but each one contains different type of information. The first data consist of 60 features. I tried 100 different random states and measured the average |R| using the leave-20%-out cross-validation. Below are the results from the first data: RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 LASSO: |R|= 0.247411754937 +- 0.232325286471 GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 The second type of data consist of 456 features. Below are the results for these too: RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 At the end I want to combine models created from these data types using a meta-estimator (that was my original question). The combination with the highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR that combined the best MLPRegressor from data type 1 and the best MLPRegressor from data type2: On 10 January 2017 at 01:36, Jacob Schreiber wrote: > Even with a single layer with 10 neurons you're still trying to train over > 6000 parameters using ~30 samples. Dropout is a concept common in neural > networks, but doesn't appear to be in sklearn's implementation of MLPs. > Early stopping based on validation performance isn't an "extra" step for > reducing overfitting, it's basically a required step for neural networks. > It seems like you have a validation sample of ~6 datapoints.. I'm still > very skeptical of that giving you proper results for a complex model. Will > this larger dataset be of exactly the same data? Just taking another > unrelated dataset and showing that a MLP can learn it doesn't mean it will > work for your specific data. Can you post the actual results from using > LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? > > On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds > wrote: > >> If you dont have a large dataset, you can still do leave one out cross >> validation. >> >> On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis >> wrote: >> >>> >>> Jacob & Sebastian, >>> >>> I think the best way to find out if my modeling approach works is to >>> find a larger dataset, split it into two parts, the first one will be used >>> as training/cross-validation set and the second as a test set, like in a >>> real case scenario. >>> >>> Regarding the MLPRegressor regularization, below is my optimum setup: >>> >>> MLPRegressor(random_state=random_state, max_iter=400, >>> early_stopping=True, validation_fraction=0.2, alpha=10, >>> hidden_layer_sizes=(10,)) >>> >>> >>> This means only one hidden layer with maximum 10 neurons, alpha=10 for >>> L2 regularization and early stopping to terminate training if validation >>> score is not improving. I think this is a quite simple model. My final >>> predictor is an SVR that combines 2 MLPRegressors, each one trained with >>> different types of input data. >>> >>> @Sebastian >>> You have mentioned dropout again but I could not find it in the docs: >>> http://scikit-learn.org/stable/modules/generated/sklearn. >>> neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor >>> >>> Maybe you are referring to another MLPRegressor implementation? I have >>> seen a while ago another implementation you had on github. Can you clarify >>> which one you recommend and why? >>> >>> >>> Thank you both of you for your hints! >>> >>> best >>> Thomas >>> >>> >>> >>> -- >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ====================================================================== >>> >>> >>> Thomas Evangelidis >>> >>> >>> Research Specialist >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> >>> tevang3 at gmail.com >>> >>> >>> >>> website: >>> >>> https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Tue Jan 10 10:16:16 2017 From: raga.markely at gmail.com (Raga Markely) Date: Tue, 10 Jan 2017 10:16:16 -0500 Subject: [scikit-learn] Generalized Discriminant Analysis with Kernel Message-ID: Thank you very much for your info on Nystroem kernel approximator. I appreciate it! Best, Raga On Tue, Jan 10, 2017 at 7:47 AM, wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > Date: Tue, 10 Jan 2017 11:58:59 +0300 > From: avn at mccme.ru > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Generalized Discriminant Analysis with > Kernel > Message-ID: > Content-Type: text/plain; charset=UTF-8; format=flowed > > Hi Raga, > > You may try approximating your kernel using Nystroem kernel approximator > (kernel_approximation.Nystroem) and then apply LDA to the transformed > feature vectors. If you choose dimensionality of the target space > (n_components) large enough (depending on your kernel and data), > Nystroem approximator should provide sufficiently good kernel > approximation for such combination to approximate GDA. > > Raga Markely ????? 2017-01-09 19:29: > > Hello, > > > > I wonder if scikit-learn has implementation for generalized > > discriminant analysis using kernel approach? > > http://www.kernel-machines.org/papers/upload_21840_GDA.pdf > > > > I did some search, but couldn't find. > > > > Thank you, > > Raga > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Tue Jan 10 12:36:33 2017 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Tue, 10 Jan 2017 12:36:33 -0500 Subject: [scikit-learn] Specify boosting percentage using Randomoversampling? Message-ID: Hi all, I apologize - i've been looking for this answer all over the internet, and it could be that I'm not googling the right terms. For managing unbalanced datasets, Weka has SMOTE, and scikit has randomoversampling. In weka, we can ask it to boost by a given percentage (say 100%) so an undersampled class with 10 values ends up with 20 values (100% increase) after boosting. In Scikit learn, I cant seem to find a way to do this. The ramdomoversampler boosts arbitrarily. and seem to try to balance the two classes, which may not be realistic in some cases. Can anyone point me to how I can manage boosting percentage using scikit? -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Tue Jan 10 13:04:03 2017 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 10 Jan 2017 19:04:03 +0100 Subject: [scikit-learn] Specify boosting percentage using Randomoversampling? In-Reply-To: References: Message-ID: Is maybe this contrib what you are looking for? Take a close look to see whether it does what you expect. http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/over-sampling/plot_smote.html On Tue, Jan 10, 2017 at 6:36 PM, Suranga Kasthurirathne < surangakas at gmail.com> wrote: > > Hi all, > > I apologize - i've been looking for this answer all over the internet, and > it could be that I'm not googling the right terms. > > For managing unbalanced datasets, Weka has SMOTE, and scikit has > randomoversampling. > > In weka, we can ask it to boost by a given percentage (say 100%) so an > undersampled class with 10 values ends up with 20 values (100% increase) > after boosting. > > In Scikit learn, I cant seem to find a way to do this. The > ramdomoversampler boosts arbitrarily. and seem to try to balance the two > classes, which may not be realistic in some cases. > > Can anyone point me to how I can manage boosting percentage using scikit? > > -- > Best Regards, > Suranga > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Jan 10 13:05:49 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 10 Jan 2017 19:05:49 +0100 Subject: [scikit-learn] Specify boosting percentage using Randomoversampling? In-Reply-To: References: Message-ID: I will first assume that RandomOverSampling refer to imbalanced-learn API (a scikit-learn-contrib project). The parameter that you are seeking for is the ratio parameter. By default ratio='auto' which will balance the classes, as you described. The ratio can be given as a float as the ratio of the number of samples in the minority class over the number of samples in in the majority class. Check there for more info: http://contrib.scikit-learn.org/imbalanced-learn/generated/imblearn.over_sampling.RandomOverSampler.html#imblearn.over_sampling.RandomOverSampler On 10 January 2017 at 18:36, Suranga Kasthurirathne wrote: > > Hi all, > > I apologize - i've been looking for this answer all over the internet, and > it could be that I'm not googling the right terms. > > For managing unbalanced datasets, Weka has SMOTE, and scikit has > randomoversampling. > > In weka, we can ask it to boost by a given percentage (say 100%) so an > undersampled class with 10 values ends up with 20 values (100% increase) > after boosting. > > In Scikit learn, I cant seem to find a way to do this. The > ramdomoversampler boosts arbitrarily. and seem to try to balance the two > classes, which may not be realistic in some cases. > > Can anyone point me to how I can manage boosting percentage using scikit? > > -- > Best Regards, > Suranga > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From surangakas at gmail.com Tue Jan 10 13:24:14 2017 From: surangakas at gmail.com (Suranga Kasthurirathne) Date: Tue, 10 Jan 2017 13:24:14 -0500 Subject: [scikit-learn] Specify boosting percentage using Randomoversampling? In-Reply-To: References: Message-ID: Well actually, i'm able to answer this myself. Its the ratio attribute (see: http://contrib.scikit-learn.org/imbalanced-learn/generated/imblearn.over_sampling.RandomOverSampler.html ) :) :) On Tue, Jan 10, 2017 at 12:36 PM, Suranga Kasthurirathne < surangakas at gmail.com> wrote: > > Hi all, > > I apologize - i've been looking for this answer all over the internet, and > it could be that I'm not googling the right terms. > > For managing unbalanced datasets, Weka has SMOTE, and scikit has > randomoversampling. > > In weka, we can ask it to boost by a given percentage (say 100%) so an > undersampled class with 10 values ends up with 20 values (100% increase) > after boosting. > > In Scikit learn, I cant seem to find a way to do this. The > ramdomoversampler boosts arbitrarily. and seem to try to balance the two > classes, which may not be realistic in some cases. > > Can anyone point me to how I can manage boosting percentage using scikit? > > -- > Best Regards, > Suranga > -- Best Regards, Suranga -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Tue Jan 10 13:47:16 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 10 Jan 2017 10:47:16 -0800 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Thomas, Jacob's point is important -- its not the number of features that's important, its the number of free parameters. As the number of free parameters increases, the space of representable functions grows to the point where the cost function is minimized by having a single parameter explain each variable. This is true of many ML methods. In the case of a decision trees, for example you can allow each node (a free parameter) hold exactly 1 training example, and see perfect training performance. In linear methods, you can perfectly fit training data by adding additional polynomial features (for feature x_i, add x^2_i, x^3_i, x^4_i, ....) Performance on unseen data will be terrible. MLP is no different -- adding more free parameters (more flexibility to precisely model the training data) may harm more than help when it comes to unseen data performance, especially when the number of examples it small. Early stopping may help overfitting, as might dropout. The likely reasons that LASSO and GBR performed well is that they're methods that explicit manage overfitting. Perform a grid search on: - the number of hidden nodes in you MLP. - the number of iterations for both, you may find lowering values will improve performance on unseen data. On Tue, Jan 10, 2017 at 4:46 AM, Thomas Evangelidis wrote: > Jacob, > > The features are not 6000. I train 2 MLPRegressors from two types of > data, both refer to the same dataset (35 molecules in total) but each one > contains different type of information. The first data consist of 60 > features. I tried 100 different random states and measured the average |R| > using the leave-20%-out cross-validation. Below are the results from the > first data: > > RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 > LASSO: |R|= 0.247411754937 +- 0.232325286471 > GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 > MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 > > The second type of data consist of 456 features. Below are the results for > these too: > > RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 > LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 > GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 > MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 > > > At the end I want to combine models created from these data types using a > meta-estimator (that was my original question). The combination with the > highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR > that combined the best MLPRegressor from data type 1 and the best > MLPRegressor from data type2: > > > > > > On 10 January 2017 at 01:36, Jacob Schreiber > wrote: > >> Even with a single layer with 10 neurons you're still trying to train >> over 6000 parameters using ~30 samples. Dropout is a concept common in >> neural networks, but doesn't appear to be in sklearn's implementation of >> MLPs. Early stopping based on validation performance isn't an "extra" step >> for reducing overfitting, it's basically a required step for neural >> networks. It seems like you have a validation sample of ~6 datapoints.. I'm >> still very skeptical of that giving you proper results for a complex model. >> Will this larger dataset be of exactly the same data? Just taking another >> unrelated dataset and showing that a MLP can learn it doesn't mean it will >> work for your specific data. Can you post the actual results from using >> LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? >> >> On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds < >> stuart at stuartreynolds.net> wrote: >> >>> If you dont have a large dataset, you can still do leave one out cross >>> validation. >>> >>> On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis >>> wrote: >>> >>>> >>>> Jacob & Sebastian, >>>> >>>> I think the best way to find out if my modeling approach works is to >>>> find a larger dataset, split it into two parts, the first one will be used >>>> as training/cross-validation set and the second as a test set, like in a >>>> real case scenario. >>>> >>>> Regarding the MLPRegressor regularization, below is my optimum setup: >>>> >>>> MLPRegressor(random_state=random_state, max_iter=400, >>>> early_stopping=True, validation_fraction=0.2, alpha=10, >>>> hidden_layer_sizes=(10,)) >>>> >>>> >>>> This means only one hidden layer with maximum 10 neurons, alpha=10 for >>>> L2 regularization and early stopping to terminate training if validation >>>> score is not improving. I think this is a quite simple model. My final >>>> predictor is an SVR that combines 2 MLPRegressors, each one trained with >>>> different types of input data. >>>> >>>> @Sebastian >>>> You have mentioned dropout again but I could not find it in the docs: >>>> http://scikit-learn.org/stable/modules/generated/sklearn.neu >>>> ral_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor >>>> >>>> Maybe you are referring to another MLPRegressor implementation? I have >>>> seen a while ago another implementation you had on github. Can you clarify >>>> which one you recommend and why? >>>> >>>> >>>> Thank you both of you for your hints! >>>> >>>> best >>>> Thomas >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ====================================================================== >>>> >>>> >>>> Thomas Evangelidis >>>> >>>> >>>> Research Specialist >>>> CEITEC - Central European Institute of Technology >>>> Masaryk University >>>> Kamenice 5/A35/1S081, >>>> 62500 Brno, Czech Republic >>>> >>>> email: tevang at pharm.uoa.gr >>>> >>>> >>>> tevang3 at gmail.com >>>> >>>> >>>> >>>> website: >>>> >>>> https://sites.google.com/site/thomasevangelidishomepage/ >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> scikit-learn mailing list >>>> >>>> scikit-learn at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Tue Jan 10 14:47:23 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 10 Jan 2017 20:47:23 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: Stuart, I didn't see LASSO performing well, especially with the second type of data. The alpha parameter probably needs adjustment with LassoCV. I don't know if you have read my previous messages on this thread, so I quote again my setting for MLPRegressor. MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) So to sum up, I must select the lowest possible value for the following parameters: * max_iter * hidden_layer_sizes (lower than 10?) * number of features in my training data. I.e. the first type of data that consisted of 60 features are preferable from that second that consisted of 456. Is this correct? On 10 January 2017 at 19:47, Stuart Reynolds wrote: > Thomas, > Jacob's point is important -- its not the number of features that's > important, its the number of free parameters. As the number of free > parameters increases, the space of representable functions grows to the > point where the cost function is minimized by having a single parameter > explain each variable. This is true of many ML methods. > > In the case of a decision trees, for example you can allow each node (a > free parameter) hold exactly 1 training example, and see perfect training > performance. In linear methods, you can perfectly fit training data by > adding additional polynomial features (for feature x_i, add x^2_i, x^3_i, > x^4_i, ....) Performance on unseen data will be terrible. > MLP is no different -- adding more free parameters (more flexibility to > precisely model the training data) may harm more than help when it comes to > unseen data performance, especially when the number of examples it small. > > Early stopping may help overfitting, as might dropout. > > The likely reasons that LASSO and GBR performed well is that they're > methods that explicit manage overfitting. > > Perform a grid search on: > - the number of hidden nodes in you MLP. > - the number of iterations > > for both, you may find lowering values will improve performance on unseen > data. > > > > > > > > > > On Tue, Jan 10, 2017 at 4:46 AM, Thomas Evangelidis > wrote: > >> Jacob, >> >> The features are not 6000. I train 2 MLPRegressors from two types of >> data, both refer to the same dataset (35 molecules in total) but each >> one contains different type of information. The first data consist of 60 >> features. I tried 100 different random states and measured the average |R| >> using the leave-20%-out cross-validation. Below are the results from the >> first data: >> >> RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 >> LASSO: |R|= 0.247411754937 +- 0.232325286471 >> GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 >> MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 >> >> The second type of data consist of 456 features. Below are the results >> for these too: >> >> RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 >> LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 >> GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 >> MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 >> >> >> At the end I want to combine models created from these data types using a >> meta-estimator (that was my original question). The combination with the >> highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR >> that combined the best MLPRegressor from data type 1 and the best >> MLPRegressor from data type2: >> >> >> >> >> >> On 10 January 2017 at 01:36, Jacob Schreiber >> wrote: >> >>> Even with a single layer with 10 neurons you're still trying to train >>> over 6000 parameters using ~30 samples. Dropout is a concept common in >>> neural networks, but doesn't appear to be in sklearn's implementation of >>> MLPs. Early stopping based on validation performance isn't an "extra" step >>> for reducing overfitting, it's basically a required step for neural >>> networks. It seems like you have a validation sample of ~6 datapoints.. I'm >>> still very skeptical of that giving you proper results for a complex model. >>> Will this larger dataset be of exactly the same data? Just taking another >>> unrelated dataset and showing that a MLP can learn it doesn't mean it will >>> work for your specific data. Can you post the actual results from using >>> LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? >>> >>> On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds < >>> stuart at stuartreynolds.net> wrote: >>> >>>> If you dont have a large dataset, you can still do leave one out cross >>>> validation. >>>> >>>> On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis >>>> wrote: >>>> >>>>> >>>>> Jacob & Sebastian, >>>>> >>>>> I think the best way to find out if my modeling approach works is to >>>>> find a larger dataset, split it into two parts, the first one will be used >>>>> as training/cross-validation set and the second as a test set, like in a >>>>> real case scenario. >>>>> >>>>> Regarding the MLPRegressor regularization, below is my optimum setup: >>>>> >>>>> MLPRegressor(random_state=random_state, max_iter=400, >>>>> early_stopping=True, validation_fraction=0.2, alpha=10, >>>>> hidden_layer_sizes=(10,)) >>>>> >>>>> >>>>> This means only one hidden layer with maximum 10 neurons, alpha=10 for >>>>> L2 regularization and early stopping to terminate training if validation >>>>> score is not improving. I think this is a quite simple model. My final >>>>> predictor is an SVR that combines 2 MLPRegressors, each one trained with >>>>> different types of input data. >>>>> >>>>> @Sebastian >>>>> You have mentioned dropout again but I could not find it in the docs: >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.neu >>>>> ral_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor >>>>> >>>>> Maybe you are referring to another MLPRegressor implementation? I have >>>>> seen a while ago another implementation you had on github. Can you clarify >>>>> which one you recommend and why? >>>>> >>>>> >>>>> Thank you both of you for your hints! >>>>> >>>>> best >>>>> Thomas >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ====================================================================== >>>>> >>>>> >>>>> Thomas Evangelidis >>>>> >>>>> >>>>> Research Specialist >>>>> CEITEC - Central European Institute of Technology >>>>> Masaryk University >>>>> Kamenice 5/A35/1S081, >>>>> 62500 Brno, Czech Republic >>>>> >>>>> email: tevang at pharm.uoa.gr >>>>> >>>>> >>>>> tevang3 at gmail.com >>>>> >>>>> >>>>> >>>>> website: >>>>> >>>>> https://sites.google.com/site/thomasevangelidishomepage/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> >>>>> scikit-learn mailing list >>>>> >>>>> scikit-learn at python.org >>>>> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jan 11 11:43:08 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 11 Jan 2017 11:43:08 -0500 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: <20170109151546.GM2802991@phare.normalesup.org> References: <20170109151546.GM2802991@phare.normalesup.org> Message-ID: On 01/09/2017 10:15 AM, Gael Varoquaux wrote: >> instead of setting up a roadmap I would rather just identify bugs that >> are blockers and fix only those and don't wait for any feature before >> cutting 0.19.X. > I agree with the sentiment, but this would mess with our deprecation cycle. If we release now, and then release again soonish, that means people have less calendar time to react to deprecations. We could either accept this or change all deprecations and bump the removal by a version? From t3kcit at gmail.com Wed Jan 11 11:48:01 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 11 Jan 2017 11:48:01 -0500 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: Message-ID: On 01/09/2017 09:43 AM, Olivier Grisel wrote: > In retrospect, making a small 0.19 release is probably a good idea. > > I would like to get > https://github.com/scikit-learn/scikit-learn/pull/8002 in before > cutting the 0.19.X branch. > Either way, I consider these two blocking for any kind of release: https://github.com/scikit-learn/scikit-learn/pull/7356 https://github.com/scikit-learn/scikit-learn/pull/6727 I have to write three grants in the next ~three weeks and start my first lecture. Don't count on me too much until mid-Feb. From se.raschka at gmail.com Wed Jan 11 15:16:22 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 11 Jan 2017 15:16:22 -0500 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> Message-ID: <4FB27AB1-1DFB-4D0C-A29A-405AF30B65AE@gmail.com> Hi, Thomas, I was just reading through a recent preprint (Protein-Ligand Scoring with Convolutional Neural Networks, https://arxiv.org/abs/1612.02751), and I thought that may be related to your task and maybe interesting or even useful for your work. Also check out references 13, 21, 22, and 24, where they talk about alternative (the more classic) representations of protein-ligand complexes or interactions as inputs to either random forests or multi-layer perceptrons. Best, Sebastian > On Jan 10, 2017, at 7:46 AM, Thomas Evangelidis wrote: > > Jacob, > > The features are not 6000. I train 2 MLPRegressors from two types of data, both refer to the same dataset (35 molecules in total) but each one contains different type of information. The first data consist of 60 features. I tried 100 different random states and measured the average |R| using the leave-20%-out cross-validation. Below are the results from the first data: > > RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 > LASSO: |R|= 0.247411754937 +- 0.232325286471 > GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 > MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 > > The second type of data consist of 456 features. Below are the results for these too: > > RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 > LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 > GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 > MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 > > > At the end I want to combine models created from these data types using a meta-estimator (that was my original question). The combination with the highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR that combined the best MLPRegressor from data type 1 and the best MLPRegressor from data type2: > > > > > > On 10 January 2017 at 01:36, Jacob Schreiber wrote: > Even with a single layer with 10 neurons you're still trying to train over 6000 parameters using ~30 samples. Dropout is a concept common in neural networks, but doesn't appear to be in sklearn's implementation of MLPs. Early stopping based on validation performance isn't an "extra" step for reducing overfitting, it's basically a required step for neural networks. It seems like you have a validation sample of ~6 datapoints.. I'm still very skeptical of that giving you proper results for a complex model. Will this larger dataset be of exactly the same data? Just taking another unrelated dataset and showing that a MLP can learn it doesn't mean it will work for your specific data. Can you post the actual results from using LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? > > On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds wrote: > If you dont have a large dataset, you can still do leave one out cross validation. > > On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis wrote: > > Jacob & Sebastian, > > I think the best way to find out if my modeling approach works is to find a larger dataset, split it into two parts, the first one will be used as training/cross-validation set and the second as a test set, like in a real case scenario. > > Regarding the MLPRegressor regularization, below is my optimum setup: > > MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) > > This means only one hidden layer with maximum 10 neurons, alpha=10 for L2 regularization and early stopping to terminate training if validation score is not improving. I think this is a quite simple model. My final predictor is an SVR that combines 2 MLPRegressors, each one trained with different types of input data. > > @Sebastian > You have mentioned dropout again but I could not find it in the docs: > http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor > > Maybe you are referring to another MLPRegressor implementation? I have seen a while ago another implementation you had on github. Can you clarify which one you recommend and why? > > > Thank you both of you for your hints! > > best > Thomas > > > > -- > > > > > > > > > > > > > > > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > website: > > https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Wed Jan 11 16:41:51 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 12 Jan 2017 08:41:51 +1100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: <20170109151546.GM2802991@phare.normalesup.org> Message-ID: When the two versions deprecation policy was instituted, releases were much more frequent... Is that enough of an excuse? On 12 January 2017 at 03:43, Andreas Mueller wrote: > > > On 01/09/2017 10:15 AM, Gael Varoquaux wrote: > >> instead of setting up a roadmap I would rather just identify bugs that >>> are blockers and fix only those and don't wait for any feature before >>> cutting 0.19.X. >>> >> >> I agree with the sentiment, but this would mess with our deprecation > cycle. > If we release now, and then release again soonish, that means people have > less calendar time > to react to deprecations. > > We could either accept this or change all deprecations and bump the > removal by a version? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Jan 11 16:51:15 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 11 Jan 2017 22:51:15 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: <20170109151546.GM2802991@phare.normalesup.org> Message-ID: <20170111215115.GO1585067@phare.normalesup.org> On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: > When the two versions deprecation policy was instituted, releases were much > more frequent... Is that enough of an excuse? I'd rather say that we can here decide that we are giving a longer grace period. I think that slow deprecations are a good things (see titus's blog post here: http://ivory.idyll.org/blog/2017-pof-software-archivability.html ) G > On 12 January 2017 at 03:43, Andreas Mueller wrote: > On 01/09/2017 10:15 AM, Gael Varoquaux wrote: > instead of setting up a roadmap I would rather just identify bugs > that > are blockers and fix only those and don't wait for any feature > before > cutting 0.19.X. > I agree with the sentiment, but this would mess with our deprecation cycle. > If we release now, and then release again soonish, that means people have > less calendar time > to react to deprecations. > We could either accept this or change all deprecations and bump the removal > by a version? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From ismaelfm_ at ciencias.unam.mx Thu Jan 12 11:47:20 2017 From: ismaelfm_ at ciencias.unam.mx (=?UTF-8?B?Sm9zw6kgSXNtYWVsIEZlcm7DoW5kZXogTWFydMOtbmV6?=) Date: Thu, 12 Jan 2017 10:47:20 -0600 Subject: [scikit-learn] Roc curve from multilabel classification has slope In-Reply-To: <587205EC.6060402@gmail.com> References: <6EEF6426-91D8-40D1-8FB8-E2F10D0327CA@ciencias.unam.mx> <587205EC.6060402@gmail.com> Message-ID: That's indeed the case, there are ties in my predictions. In response to "plotting one ROC curve for every class in your result", it's also part of my analysis. Thank you very much. Ismael 2017-01-08 3:27 GMT-06:00 Roman Yurchak : > Jos?, I might be misunderstanding something, but wouldn't it make more > sens to plot one ROC curve for every class in your result (using all > samples at once), as opposed to plotting it for every training sample as > you are doing now? Cf the example below, > > http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html > > Roman > > On 08/01/17 01:42, Jacob Schreiber wrote: > > Slope usually means there are ties in your predictions. Check your > > dataset to see if you have repeated predicted values (possibly 1 or 0). > > > > On Sat, Jan 7, 2017 at 4:32 PM, Jos? Ismael Fern?ndez Mart?nez > > > wrote: > > > > But is not a scikit-learn classifier, is a keras classifier which, > > in the functional API, predict returns probabilities. > > What I don't understand is why my plot of the roc curve has a slope, > > since I call roc_curve passing the actual label as y_true and the > > output of the classifier (score probabilities) as y_score for every > > element tested. > > > > > > > > Sent from my iPhone > > On Jan 7, 2017, at 4:04 PM, Joel Nothman > > wrote: > > > >> predict method should not return probabilities in scikit-learn > >> classifiers. predict_proba should. > >> > >> On 8 January 2017 at 07:52, Jos? Ismael Fern?ndez Mart?nez > >> > > >> wrote: > >> > >> Hi, I have a multilabel classifier written in Keras from which > >> I want to compute AUC and plot a ROC curve for every element > >> classified from my test set. > >> > >> > >> > >> Everything seems fine, except that some elements have a roc > >> curve that have a slope as follows: > >> > >> enter image description here > >> I don't know how to > >> interpret the slope in such cases. > >> > >> Basically my workflow goes as follows, I have a > >> pre-trained |model|, instance of Keras, and I have the > >> features |X| and the binarized labels |y|, every element > >> in |y| is an array of length 1000, as it is a multilabel > >> classification problem each element in |y| might contain many > >> 1s, indicating that the element belongs to multiples classes, > >> so I used the built-in loss of |binary_crossentropy| and my > >> outputs of the model prediction are score probailities. Then I > >> plot the roc curve as follows. > >> > >> > >> The predict method returns probabilities, as I'm using the > >> functional api of keras. > >> > >> Does anyone knows why my roc curves looks like this? > >> > >> > >> Ismael > >> > >> > >> > >> Sent from my iPhone > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maciek at wojcikowski.pl Mon Jan 16 09:57:05 2017 From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=) Date: Mon, 16 Jan 2017 15:57:05 +0100 Subject: [scikit-learn] meta-estimator for multiple MLPRegressor In-Reply-To: <4FB27AB1-1DFB-4D0C-A29A-405AF30B65AE@gmail.com> References: <27CD690B-CA77-4121-8C95-9F2E52B99B95@gmail.com> <450C2C8D-86FC-4A87-B307-C5E45FE97C4B@gmail.com> <4FB27AB1-1DFB-4D0C-A29A-405AF30B65AE@gmail.com> Message-ID: Hi Thomas, An example os such "dummy" meta-regressor can be seen in NNScore, which is protein-ligand scoring function (one of Sebastian's suggestions). A meta-class is implemented in Open Drug Discovery Toolkit [here: https://github.com/oddt/oddt/blob/master/oddt/scoring/__init__.py#L200], along with also suggested RF-Score and few other methods you might find useful. Actually, what NNScore does it train 1000 MLPRegressors and pick 20 best scored on PDBbind test set. An ensemble prediction is mean prediction of those best models. ---- Pozdrawiam, | Best regards, Maciek W?jcikowski maciek at wojcikowski.pl 2017-01-11 21:16 GMT+01:00 Sebastian Raschka : > Hi, Thomas, > > I was just reading through a recent preprint (Protein-Ligand Scoring with > Convolutional Neural Networks, https://arxiv.org/abs/1612.02751), and I > thought that may be related to your task and maybe interesting or even > useful for your work. > Also check out references 13, 21, 22, and 24, where they talk about > alternative (the more classic) representations of protein-ligand complexes > or interactions as inputs to either random forests or multi-layer > perceptrons. > > Best, > Sebastian > > > > On Jan 10, 2017, at 7:46 AM, Thomas Evangelidis > wrote: > > > > Jacob, > > > > The features are not 6000. I train 2 MLPRegressors from two types of > data, both refer to the same dataset (35 molecules in total) but each one > contains different type of information. The first data consist of 60 > features. I tried 100 different random states and measured the average |R| > using the leave-20%-out cross-validation. Below are the results from the > first data: > > > > RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 > > LASSO: |R|= 0.247411754937 +- 0.232325286471 > > GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 > > MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 > > > > The second type of data consist of 456 features. Below are the results > for these too: > > > > RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 > > LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 > > GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 > > MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 > > > > > > At the end I want to combine models created from these data types using > a meta-estimator (that was my original question). The combination with the > highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR that > combined the best MLPRegressor from data type 1 and the best MLPRegressor > from data type2: > > > > > > > > > > > > On 10 January 2017 at 01:36, Jacob Schreiber > wrote: > > Even with a single layer with 10 neurons you're still trying to train > over 6000 parameters using ~30 samples. Dropout is a concept common in > neural networks, but doesn't appear to be in sklearn's implementation of > MLPs. Early stopping based on validation performance isn't an "extra" step > for reducing overfitting, it's basically a required step for neural > networks. It seems like you have a validation sample of ~6 datapoints.. I'm > still very skeptical of that giving you proper results for a complex model. > Will this larger dataset be of exactly the same data? Just taking another > unrelated dataset and showing that a MLP can learn it doesn't mean it will > work for your specific data. Can you post the actual results from using > LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? > > > > On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds < > stuart at stuartreynolds.net> wrote: > > If you dont have a large dataset, you can still do leave one out cross > validation. > > > > On Mon, Jan 9, 2017 at 3:42 PM Thomas Evangelidis > wrote: > > > > Jacob & Sebastian, > > > > I think the best way to find out if my modeling approach works is to > find a larger dataset, split it into two parts, the first one will be used > as training/cross-validation set and the second as a test set, like in a > real case scenario. > > > > Regarding the MLPRegressor regularization, below is my optimum setup: > > > > MLPRegressor(random_state=random_state, max_iter=400, > early_stopping=True, validation_fraction=0.2, alpha=10, > hidden_layer_sizes=(10,)) > > > > This means only one hidden layer with maximum 10 neurons, alpha=10 for > L2 regularization and early stopping to terminate training if validation > score is not improving. I think this is a quite simple model. My final > predictor is an SVR that combines 2 MLPRegressors, each one trained with > different types of input data. > > > > @Sebastian > > You have mentioned dropout again but I could not find it in the docs: > > http://scikit-learn.org/stable/modules/generated/sklearn.neural_network. > MLPRegressor.html#sklearn.neural_network.MLPRegressor > > > > Maybe you are referring to another MLPRegressor implementation? I have > seen a while ago another implementation you had on github. Can you clarify > which one you recommend and why? > > > > > > Thank you both of you for your hints! > > > > best > > Thomas > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ====================================================================== > > > > > > Thomas Evangelidis > > > > > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > > > > > tevang3 at gmail.com > > > > > > > > website: > > > > https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From max.linke88 at gmail.com Mon Jan 16 10:23:05 2017 From: max.linke88 at gmail.com (Max Linke) Date: Mon, 16 Jan 2017 16:23:05 +0100 Subject: [scikit-learn] GSOC 2017: NumFOCUS will be an umbrella organization Message-ID: <40f85dfb-0635-a042-79b7-039d3dd347a9@gmail.com> Hi Organizations can start submitting applications for Google Summer of Code 2017 on January 19 (and the deadline is February 9) https://developers.google.com/open-source/gsoc/timeline?hl=en NumFOCUS will be applying again this year. If you want to work with us please let me know and if you apply as an organization yourself or under a different umbrella organization please tell me as well. If you participate with us it would be great if you start to add possible projects to the ideas page on github soon. We some general information for mentors on github. https://github.com/numfocus/gsoc/blob/master/CONTRIBUTING-mentors.md We also have a template for ideas that might help. It lists the things Google likes to see. https://github.com/numfocus/gsoc/blob/master/2017/ideas-list-skeleton.md In case you participated in earlier years with NumFOCUS there are some small changes this year. Raniere won't be the admin this year. Instead I'm going to be the admin. We are also planning to include two explicit rules when a student should be failed, they have to communicate regularly and commit code into your development branch at the end of the summer. best, Max From aadityajamuar at gmail.com Fri Jan 20 09:19:02 2017 From: aadityajamuar at gmail.com (Aaditya Jamuar) Date: Fri, 20 Jan 2017 19:49:02 +0530 Subject: [scikit-learn] Pipeline conventions for wrappers Message-ID: Hi Guys, I am currently working on gensim ( https://github.com/RaRe-Technologies/gensim) , writing wrappers for Scikit-learn for easy integration of LDA ( https://github.com/RaRe-Technologies/gensim/pull/932/files). While I have covered most of the API conventions as specified on scikit-learn's website, I am stuck at how to implement pipelines. I am particularly looking for what are some of the conventions very specific to the pipeline architecture. Please suggest Thank you Aaditya Jamuar -------------- next part -------------- An HTML attachment was scrubbed... URL: From malcorn at redhat.com Fri Jan 20 11:37:02 2017 From: malcorn at redhat.com (Michael Alcorn) Date: Fri, 20 Jan 2017 10:37:02 -0600 Subject: [scikit-learn] PR #8190: "Implement Complement Naive Bayes." Message-ID: Hi all, I would appreciate it if a couple of maintainers could take a look at my pull request (https://github.com/scikit-learn/scikit-learn/pull/8190) implementing the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB regularly outperforms the standard Multinomial Naive Bayes (MNB) classifier on real world data sets due to the tendency for real world data sets to suffer from class imbalance. Apache Mahout offers its own implementation of CNB alongside MNB, but it would be nice to have an easily usable CNB implementation available in scikit-learn. Training the CNB classifier on a reasonably sized data set of 493,038 documents with a median length of 87 tokens and 1,155,784 distinct tokens took around 8.5 seconds. For comparison, the MNB classifier took around 4.5 seconds to train, but the CNB had a 10% lower error rate, a seemingly worthwhile tradeoff. Happy to answer any questions or discuss further. Thanks, Michael A. Alcorn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian.illner at imtek.uni-freiburg.de Fri Jan 20 10:52:08 2017 From: sebastian.illner at imtek.uni-freiburg.de (Sebastian Illner) Date: Fri, 20 Jan 2017 16:52:08 +0100 Subject: [scikit-learn] Identify spectra with "marker" Message-ID: <784ddc76-6e28-77a6-0ae1-7de9212d3764@imtek.uni-freiburg.de> Hi guys, I'm new to NIR-measurement as wenn as chemometrics. My current project involvs the recognition of determined spectra (of a reference system) among others. The materials are currentlys not really set. So I try to give a predetermined mixture of substances into another matrix and group the measured NIR-spectra according to a) contains predetermined mixture and b) does not contain mixture (but other mixtures could be possible). This way the mixture could be used as a unique marker. What would be the best chemometric way to accomplish this task? Currently I am trying to use PLS-DA, SMC and PCA (combined with a distance quantifier). Thanks for your thought about this. seb -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jeremiah.Johnson at unh.edu Fri Jan 20 14:16:54 2017 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Fri, 20 Jan 2017 19:16:54 +0000 Subject: [scikit-learn] top N accuracy classification metric Message-ID: Hi all, It's common to use a top-n accuracy metric for multi-class classification problems, where for each observation the prediction is the set of probabilities for each of the classes, and a prediction is top-N accurate if the correct class is among the N highest predicted probability classes. I've written a simple implementation, but I don't think it quite fits the sklearn api. Specifically, _check_targets objects to the the continuous-multioutput format of the predictions for a classification task. Is there any interest in including a metric like this? I'd be happy to submit a pull request. Jeremiah -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 21 05:49:39 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 21 Jan 2017 21:49:39 +1100 Subject: [scikit-learn] Pipeline conventions for wrappers In-Reply-To: References: Message-ID: I think you'll need to be more specific. What do you want a pipeline to do for you? On 21 January 2017 at 01:19, Aaditya Jamuar wrote: > Hi Guys, > > I am currently working on gensim (https://github.com/RaRe- > Technologies/gensim) , writing wrappers for Scikit-learn for easy > integration of LDA (https://github.com/RaRe-Technologies/gensim/pull/932/ > files). > > While I have covered most of the API conventions as specified on > scikit-learn's website, I am stuck at how to implement pipelines. > > I am particularly looking for what are some of the conventions very > specific to the pipeline architecture. > > Please suggest > > Thank you > > Aaditya Jamuar > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 21 05:50:33 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 21 Jan 2017 21:50:33 +1100 Subject: [scikit-learn] Identify spectra with "marker" In-Reply-To: <784ddc76-6e28-77a6-0ae1-7de9212d3764@imtek.uni-freiburg.de> References: <784ddc76-6e28-77a6-0ae1-7de9212d3764@imtek.uni-freiburg.de> Message-ID: Wrong mailing list? On 21 January 2017 at 02:52, Sebastian Illner < sebastian.illner at imtek.uni-freiburg.de> wrote: > Hi guys, > I'm new to NIR-measurement as wenn as chemometrics. My current project > involvs the recognition of determined spectra (of a reference system) among > others. > The materials are currentlys not really set. So I try to give a > predetermined mixture of substances into another matrix and group the > measured NIR-spectra according to a) contains predetermined mixture and b) > does not contain mixture (but other mixtures could be possible). This way > the mixture could be used as a unique marker. > What would be the best chemometric way to accomplish this task? > > Currently I am trying to use PLS-DA, SMC and PCA (combined with a distance > quantifier). > Thanks for your thought about this. > seb > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jan 21 05:52:10 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 21 Jan 2017 21:52:10 +1100 Subject: [scikit-learn] top N accuracy classification metric In-Reply-To: References: Message-ID: There are metrics with that kind of input in sklearn.metrics.ranking. I don't have the time to look them up now, but there have been proposals and PRs for similar ranking metrics. Please search the issue tracker for related issues. Thanks, Joel On 21 January 2017 at 06:16, Johnson, Jeremiah wrote: > Hi all, > > It?s common to use a top-n accuracy metric for multi-class classification > problems, where for each observation the prediction is the set of > probabilities for each of the classes, and a prediction is top-N accurate > if the correct class is among the N highest predicted probability classes. > I?ve written a simple implementation, but I don?t think it quite fits the > sklearn api. Specifically, _check_targets objects to the the > continuous-multioutput format of the predictions for a classification task. > Is there any interest in including a metric like this? I?d be happy to > submit a pull request. > > Jeremiah > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sat Jan 21 05:54:48 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 11:54:48 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation Message-ID: <9DA76233-6CAD-4ABC-8A30-241B2F8A61CA@gmail.com> Hi guys.. I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. I therefore wanted to give random forrest a try, and see whether it could provide me a better result. I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes depending on length of the audio file. Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? Or do i have do it in a different way? and if so how? kind regards Carl truz From noflaco at gmail.com Sat Jan 21 06:18:15 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 12:18:15 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation Message-ID: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Hi guys.. I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. I therefore wanted to give random forrest a try, and see whether it could provide me a better result. I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes depending on length of the audio file. Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? Or do i have do it in a different way? and if so how? kind regards Carl truz From jmschreiber91 at gmail.com Sat Jan 21 12:25:22 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 21 Jan 2017 09:25:22 -0800 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: > Hi guys.. > > I am currently working on a ASR project in which the objective is to > substitute part of the general ASR framework with some form of neural > network, to see whether the tested part improves in any way. > > I started working with the feature extraction and tried, to make a neural > network (NN) that could create MFCC features. I already know what the > desired output is supposed to be, so the problem boils down to a simple > input - output mapping. Problem here is the my NN doesn?t seem to perform > that well.. and i seem to get pretty large error for some reason. > > I therefore wanted to give random forrest a try, and see whether it could > provide me a better result. > > I am currently storing my input and output in numpy.ndarrays, in which the > input and output columns is consistent throughout all the examples, but the > number of rows changes > depending on length of the audio file. > > Is it possible with the random forrest implementation in scikit-learn to > train a random forrest to map an input an output, given they are stored > numpy.ndarrays? > Or do i have do it in a different way? and if so how? > > kind regards > > Carl truz > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sat Jan 21 12:35:00 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 18:35:00 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: Thanks for the response! If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. > Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : > > If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? > > On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks > wrote: > Hi guys.. > > I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. > > I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple > input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. > > I therefore wanted to give random forrest a try, and see whether it could provide me a better result. > > I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes > depending on length of the audio file. > > Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? > Or do i have do it in a different way? and if so how? > > kind regards > > Carl truz > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jan 21 12:42:52 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 21 Jan 2017 09:42:52 -0800 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: I don't understand what you mean. Does each sample have a fixed number of features or not? On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: > Thanks for the response! > > If you see it in 1d then yes?. it has variable length. In 2d will the > number of columns always be constant both for the input and output. > > Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber >: > > If what you're saying is that you have a variable length input, then most > sklearn classifiers won't work on this data. They expect a fixed feature > set. Perhaps you could try extracting a set of informative features being > fed into the classifier? > > On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: > >> Hi guys.. >> >> I am currently working on a ASR project in which the objective is to >> substitute part of the general ASR framework with some form of neural >> network, to see whether the tested part improves in any way. >> >> I started working with the feature extraction and tried, to make a neural >> network (NN) that could create MFCC features. I already know what the >> desired output is supposed to be, so the problem boils down to a simple >> input - output mapping. Problem here is the my NN doesn?t seem to >> perform that well.. and i seem to get pretty large error for some reason. >> >> I therefore wanted to give random forrest a try, and see whether it could >> provide me a better result. >> >> I am currently storing my input and output in numpy.ndarrays, in which >> the input and output columns is consistent throughout all the examples, but >> the number of rows changes >> depending on length of the audio file. >> >> Is it possible with the random forrest implementation in scikit-learn to >> train a random forrest to map an input an output, given they are stored >> numpy.ndarrays? >> Or do i have do it in a different way? and if so how? >> >> kind regards >> >> Carl truz >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sat Jan 21 12:59:22 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 18:59:22 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: Most of the machine learning library i?ve tried has an option of of just give the dimension? In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) x is different for each set? But for each set is the number of columns consistent. Column consistency is usually enough for most library tools i?ve worked with? But is this not the case here? > Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : > > I don't understand what you mean. Does each sample have a fixed number of features or not? > > On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks > wrote: > Thanks for the response! > > If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. > >> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber >: >> >> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >> >> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks > wrote: >> Hi guys.. >> >> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >> >> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >> >> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >> >> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >> depending on length of the audio file. >> >> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >> Or do i have do it in a different way? and if so how? >> >> kind regards >> >> Carl truz >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sat Jan 21 13:24:17 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 21 Jan 2017 13:24:17 -0500 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: Hi, Carlton, sounds like you are looking for multilabel classification and your target array has the shape [n_samples, n_outputs]? If the output shape is consistent (aka all output label arrays have 13 columns), you should be fine, otherwise, you could use the MultiLabelBinarizer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer). Also, the RandomForestClassifier should support multillabel classification. Best, Sebastian > On Jan 21, 2017, at 12:59 PM, Carlton Banks wrote: > > Most of the machine learning library i?ve tried has an option of of just give the dimension? > In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) > x is different for each set? > But for each set is the number of columns consistent. > > Column consistency is usually enough for most library tools i?ve worked with? > But is this not the case here? >> Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : >> >> I don't understand what you mean. Does each sample have a fixed number of features or not? >> >> On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: >> Thanks for the response! >> >> If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. >> >>> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : >>> >>> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >>> >>> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: >>> Hi guys.. >>> >>> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >>> >>> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >>> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >>> >>> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >>> >>> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >>> depending on length of the audio file. >>> >>> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >>> Or do i have do it in a different way? and if so how? >>> >>> kind regards >>> >>> Carl truz >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mailfordebu at gmail.com Sat Jan 21 13:18:45 2017 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Sat, 21 Jan 2017 23:48:45 +0530 Subject: [scikit-learn] Query regarding parameter class_weight in Random Forest Classifier Message-ID: Hi All, Greetings ! I have a very basic question regarding the usage of the parameter class_weight in scikit learn's Random Forest Classifier's fit method. I have a fairly unbalanced sample and my positive class : negative class ratio is 1:100. In other words, I have a million records corresponding to negative class and 10,000 records corresponding to positive class. I have trained the random forest classifier model using the above record set successfully. Further, for a different problem, I want to test the parameter class_weight. So, I am setting the class_weight as [0:0.001 , 1:0.999] and I have tried running my model on the same dataset as mentioned in the above paragraph but with the positive class records reduced to 1000 [because now each positive class is given approximately 10 times more weight than a negative class]. However, the model run results are very very different between the 2 runs (with and without class_weight). And I expected a similar run results. Would you please be able to let me know where am I getting wrong. I know it's something silly but just want to improve on my concept. Thanks ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Sat Jan 21 13:27:37 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 19:27:37 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: Not classifiication? but regression.. and yes both the input and output should be stored stored like that.. > Den 21. jan. 2017 kl. 19.24 skrev Sebastian Raschka : > > Hi, Carlton, > sounds like you are looking for multilabel classification and your target array has the shape [n_samples, n_outputs]? If the output shape is consistent (aka all output label arrays have 13 columns), you should be fine, otherwise, you could use the MultiLabelBinarizer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer). > > Also, the RandomForestClassifier should support multillabel classification. > > Best, > Sebastian > >> On Jan 21, 2017, at 12:59 PM, Carlton Banks wrote: >> >> Most of the machine learning library i?ve tried has an option of of just give the dimension? >> In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) >> x is different for each set? >> But for each set is the number of columns consistent. >> >> Column consistency is usually enough for most library tools i?ve worked with? >> But is this not the case here? >>> Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : >>> >>> I don't understand what you mean. Does each sample have a fixed number of features or not? >>> >>> On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: >>> Thanks for the response! >>> >>> If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. >>> >>>> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : >>>> >>>> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >>>> >>>> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: >>>> Hi guys.. >>>> >>>> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >>>> >>>> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >>>> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >>>> >>>> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >>>> >>>> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >>>> depending on length of the audio file. >>>> >>>> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >>>> Or do i have do it in a different way? and if so how? >>>> >>>> kind regards >>>> >>>> Carl truz >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Sat Jan 21 13:32:58 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 21 Jan 2017 13:32:58 -0500 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> Message-ID: <287B065E-1841-4F12-9CBE-4D06A6C8525F@gmail.com> Oh okay. But that shouldn?t be a problem, the RandomForestRegressor also supports multi-outpout regression; same expected target array shape: [n_samples, n_outputs] Best, Sebastian > On Jan 21, 2017, at 1:27 PM, Carlton Banks wrote: > > Not classifiication? but regression.. > and yes both the input and output should be stored stored like that.. > >> Den 21. jan. 2017 kl. 19.24 skrev Sebastian Raschka : >> >> Hi, Carlton, >> sounds like you are looking for multilabel classification and your target array has the shape [n_samples, n_outputs]? If the output shape is consistent (aka all output label arrays have 13 columns), you should be fine, otherwise, you could use the MultiLabelBinarizer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer). >> >> Also, the RandomForestClassifier should support multillabel classification. >> >> Best, >> Sebastian >> >>> On Jan 21, 2017, at 12:59 PM, Carlton Banks wrote: >>> >>> Most of the machine learning library i?ve tried has an option of of just give the dimension? >>> In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) >>> x is different for each set? >>> But for each set is the number of columns consistent. >>> >>> Column consistency is usually enough for most library tools i?ve worked with? >>> But is this not the case here? >>>> Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : >>>> >>>> I don't understand what you mean. Does each sample have a fixed number of features or not? >>>> >>>> On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: >>>> Thanks for the response! >>>> >>>> If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. >>>> >>>>> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : >>>>> >>>>> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >>>>> >>>>> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: >>>>> Hi guys.. >>>>> >>>>> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >>>>> >>>>> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >>>>> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >>>>> >>>>> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >>>>> >>>>> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >>>>> depending on length of the audio file. >>>>> >>>>> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >>>>> Or do i have do it in a different way? and if so how? >>>>> >>>>> kind regards >>>>> >>>>> Carl truz >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Sat Jan 21 13:36:51 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 21 Jan 2017 19:36:51 +0100 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: <287B065E-1841-4F12-9CBE-4D06A6C8525F@gmail.com> References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> <287B065E-1841-4F12-9CBE-4D06A6C8525F@gmail.com> Message-ID: <9A97C9C1-8D1E-4C46-886D-A14F840ADE58@gmail.com> Thanks for the Info!.. How do you set it up.. There doesn?t seem a example available for regression purposes.. > Den 21. jan. 2017 kl. 19.32 skrev Sebastian Raschka : > > Oh okay. But that shouldn?t be a problem, the RandomForestRegressor also supports multi-outpout regression; same expected target array shape: [n_samples, n_outputs] > > Best, > Sebastian > >> On Jan 21, 2017, at 1:27 PM, Carlton Banks wrote: >> >> Not classifiication? but regression.. >> and yes both the input and output should be stored stored like that.. >> >>> Den 21. jan. 2017 kl. 19.24 skrev Sebastian Raschka : >>> >>> Hi, Carlton, >>> sounds like you are looking for multilabel classification and your target array has the shape [n_samples, n_outputs]? If the output shape is consistent (aka all output label arrays have 13 columns), you should be fine, otherwise, you could use the MultiLabelBinarizer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer). >>> >>> Also, the RandomForestClassifier should support multillabel classification. >>> >>> Best, >>> Sebastian >>> >>>> On Jan 21, 2017, at 12:59 PM, Carlton Banks wrote: >>>> >>>> Most of the machine learning library i?ve tried has an option of of just give the dimension? >>>> In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) >>>> x is different for each set? >>>> But for each set is the number of columns consistent. >>>> >>>> Column consistency is usually enough for most library tools i?ve worked with? >>>> But is this not the case here? >>>>> Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : >>>>> >>>>> I don't understand what you mean. Does each sample have a fixed number of features or not? >>>>> >>>>> On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: >>>>> Thanks for the response! >>>>> >>>>> If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. >>>>> >>>>>> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : >>>>>> >>>>>> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >>>>>> >>>>>> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: >>>>>> Hi guys.. >>>>>> >>>>>> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >>>>>> >>>>>> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >>>>>> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >>>>>> >>>>>> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >>>>>> >>>>>> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >>>>>> depending on length of the audio file. >>>>>> >>>>>> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >>>>>> Or do i have do it in a different way? and if so how? >>>>>> >>>>>> kind regards >>>>>> >>>>>> Carl truz >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Sat Jan 21 13:55:40 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sat, 21 Jan 2017 13:55:40 -0500 Subject: [scikit-learn] numpy integration with random forrest implementation In-Reply-To: <9A97C9C1-8D1E-4C46-886D-A14F840ADE58@gmail.com> References: <8DEC1F0F-D487-4C98-AE36-A7D23B78D6BB@gmail.com> <287B065E-1841-4F12-9CBE-4D06A6C8525F@gmail.com> <9A97C9C1-8D1E-4C46-886D-A14F840ADE58@gmail.com> Message-ID: It should be simply tf = RandomForestRegressor() rf.fit(X_train, y_train) rf.predict(X_validation) ... Maybe also check out this documentation example here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_regression_multioutput.html > On Jan 21, 2017, at 1:36 PM, Carlton Banks wrote: > > Thanks for the Info!.. > How do you set it up.. > > There doesn?t seem a example available for regression purposes.. >> Den 21. jan. 2017 kl. 19.32 skrev Sebastian Raschka : >> >> Oh okay. But that shouldn?t be a problem, the RandomForestRegressor also supports multi-outpout regression; same expected target array shape: [n_samples, n_outputs] >> >> Best, >> Sebastian >> >>> On Jan 21, 2017, at 1:27 PM, Carlton Banks wrote: >>> >>> Not classifiication? but regression.. >>> and yes both the input and output should be stored stored like that.. >>> >>>> Den 21. jan. 2017 kl. 19.24 skrev Sebastian Raschka : >>>> >>>> Hi, Carlton, >>>> sounds like you are looking for multilabel classification and your target array has the shape [n_samples, n_outputs]? If the output shape is consistent (aka all output label arrays have 13 columns), you should be fine, otherwise, you could use the MultiLabelBinarizer (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer). >>>> >>>> Also, the RandomForestClassifier should support multillabel classification. >>>> >>>> Best, >>>> Sebastian >>>> >>>>> On Jan 21, 2017, at 12:59 PM, Carlton Banks wrote: >>>>> >>>>> Most of the machine learning library i?ve tried has an option of of just give the dimension? >>>>> In this case my input consist of an numpy.ndarray with shape (x,2050) and the output is an numpy.ndarray with shape (x,13) >>>>> x is different for each set? >>>>> But for each set is the number of columns consistent. >>>>> >>>>> Column consistency is usually enough for most library tools i?ve worked with? >>>>> But is this not the case here? >>>>>> Den 21. jan. 2017 kl. 18.42 skrev Jacob Schreiber : >>>>>> >>>>>> I don't understand what you mean. Does each sample have a fixed number of features or not? >>>>>> >>>>>> On Sat, Jan 21, 2017 at 9:35 AM, Carlton Banks wrote: >>>>>> Thanks for the response! >>>>>> >>>>>> If you see it in 1d then yes?. it has variable length. In 2d will the number of columns always be constant both for the input and output. >>>>>> >>>>>>> Den 21. jan. 2017 kl. 18.25 skrev Jacob Schreiber : >>>>>>> >>>>>>> If what you're saying is that you have a variable length input, then most sklearn classifiers won't work on this data. They expect a fixed feature set. Perhaps you could try extracting a set of informative features being fed into the classifier? >>>>>>> >>>>>>> On Sat, Jan 21, 2017 at 3:18 AM, Carlton Banks wrote: >>>>>>> Hi guys.. >>>>>>> >>>>>>> I am currently working on a ASR project in which the objective is to substitute part of the general ASR framework with some form of neural network, to see whether the tested part improves in any way. >>>>>>> >>>>>>> I started working with the feature extraction and tried, to make a neural network (NN) that could create MFCC features. I already know what the desired output is supposed to be, so the problem boils down to a simple >>>>>>> input - output mapping. Problem here is the my NN doesn?t seem to perform that well.. and i seem to get pretty large error for some reason. >>>>>>> >>>>>>> I therefore wanted to give random forrest a try, and see whether it could provide me a better result. >>>>>>> >>>>>>> I am currently storing my input and output in numpy.ndarrays, in which the input and output columns is consistent throughout all the examples, but the number of rows changes >>>>>>> depending on length of the audio file. >>>>>>> >>>>>>> Is it possible with the random forrest implementation in scikit-learn to train a random forrest to map an input an output, given they are stored numpy.ndarrays? >>>>>>> Or do i have do it in a different way? and if so how? >>>>>>> >>>>>>> kind regards >>>>>>> >>>>>>> Carl truz >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From cleverless at gmail.com Sat Jan 21 15:26:05 2017 From: cleverless at gmail.com (Josh Vredevoogd) Date: Sat, 21 Jan 2017 12:26:05 -0800 Subject: [scikit-learn] Query regarding parameter class_weight in Random Forest Classifier In-Reply-To: References: Message-ID: The class_weight parameter doesn't behave the way you're expecting. The value in class_weight is the weight applied to each sample in that class - in your example, each class zero sample has weight 0.001 and each class one sample has weight 0.999, so each class one samples carries 999 times the weight of a class zero sample. If you would like each class one sample to have ten times the weight, you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` equivalently. On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh wrote: > Hi All, > Greetings ! > > I have a very basic question regarding the usage of the > parameter class_weight in scikit learn's Random Forest Classifier's fit > method. > > I have a fairly unbalanced sample and my positive class : > negative class ratio is 1:100. In other words, I have a million records > corresponding to negative class and 10,000 records corresponding to > positive class. I have trained the random forest classifier model using the > above record set successfully. > > Further, for a different problem, I want to test the > parameter class_weight. So, I am setting the class_weight as [0:0.001 , > 1:0.999] and I have tried running my model on the same dataset as mentioned > in the above paragraph but with the positive class records reduced to 1000 > [because now each positive class is given approximately 10 times more > weight than a negative class]. However, the model run results are very very > different between the 2 runs (with and without class_weight). And I > expected a similar run results. > > Would you please be able to let me know where am I getting > wrong. I know it's something silly but just want to improve on my concept. > > Thanks ! > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Sun Jan 22 08:00:31 2017 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Sun, 22 Jan 2017 18:30:31 +0530 Subject: [scikit-learn] Query regarding parameter class_weight in Random Forest Classifier In-Reply-To: References: Message-ID: Thanks Josh ! I have used the parameter class_weight={0: 1, 1: 10} and the model code has run successfully. However, just to get a further clarity around it's concept, I am having another question for you please. I did the following 2 tests: 1. In my dataset , I have 1 million negative classes and 10,000 positive classes. First I ran my model code without supplying any class_weight parameter and it gave me certain True Positive and False Positive results. 2. Now in the second test, I had the same 1 million negative classes but reduced the positive classes to 1000 . But this time, I supplied the parameter class_weight={0: 1, 1: 10} and got my True Positive and False Positive Results My question is , when I multiply the results obtained from my second test with a factor of 10, I don't match with the results obtained from my first test. In other words, say I get the true positive against a threshold from the second test as 8 , while the true positive from the first test against the same threshold is 260. I am getting similar observations for the false positive results wherein if I multiply the results obtained in the second test by 10, I don't come close to the results obtained from the first set. Is my expectation correct ? Is my way of executing the test (i.e., reducing the the positive classes by 10 times and then feeding a class weight of 10 times the negative classes) and comparing the results with a model run without any class weight parameter correct ? Please let me know as per your convenience as this will help me a big way to understand the concept further. Thanks in advance ! On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd wrote: > The class_weight parameter doesn't behave the way you're expecting. > > The value in class_weight is the weight applied to each sample in that > class - in your example, each class zero sample has weight 0.001 and each > class one sample has weight 0.999, so each class one samples carries 999 > times the weight of a class zero sample. > > If you would like each class one sample to have ten times the weight, you > would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` > equivalently. > > > On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh > wrote: > >> Hi All, >> Greetings ! >> >> I have a very basic question regarding the usage of the >> parameter class_weight in scikit learn's Random Forest Classifier's fit >> method. >> >> I have a fairly unbalanced sample and my positive class : >> negative class ratio is 1:100. In other words, I have a million records >> corresponding to negative class and 10,000 records corresponding to >> positive class. I have trained the random forest classifier model using the >> above record set successfully. >> >> Further, for a different problem, I want to test the >> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >> 1:0.999] and I have tried running my model on the same dataset as mentioned >> in the above paragraph but with the positive class records reduced to 1000 >> [because now each positive class is given approximately 10 times more >> weight than a negative class]. However, the model run results are very very >> different between the 2 runs (with and without class_weight). And I >> expected a similar run results. >> >> Would you please be able to let me know where am I >> getting wrong. I know it's something silly but just want to improve on my >> concept. >> >> Thanks ! >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cleverless at gmail.com Sun Jan 22 23:26:23 2017 From: cleverless at gmail.com (Josh Vredevoogd) Date: Sun, 22 Jan 2017 20:26:23 -0800 Subject: [scikit-learn] Query regarding parameter class_weight in Random Forest Classifier In-Reply-To: References: Message-ID: If you undersample, taking only 10% of the negative class, the classifier will see different combinations of attributes and produce a different fit to explain those distributions. In the worse case, imagine you are classifying birds and through sampling you eliminate all `red` examples. Your classifier likely now will not understand that red objects can be birds. That's an overly simple example, but given a classifier capable of exploring and explaining feature combinations, less obvious versions of this are bound to happen. The extrapolation only works in the other direction: if you manually duplicate samples by the sampling factor, you should get the exact same fit as if you increased the class weight. Hope that helps, Josh On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh wrote: > Thanks Josh ! > > I have used the parameter class_weight={0: 1, 1: 10} and the model code > has run successfully. However, just to get a further clarity around it's > concept, I am having another question for you please. I did the following 2 > tests: > > 1. In my dataset , I have 1 million negative classes and 10,000 positive > classes. First I ran my model code without supplying any class_weight > parameter and it gave me certain True Positive and False Positive results. > > 2. Now in the second test, I had the same 1 million negative classes but > reduced the positive classes to 1000 . But this time, I supplied the > parameter class_weight={0: 1, 1: 10} and got my True Positive and False > Positive Results > > My question is , when I multiply the results obtained from my second test > with a factor of 10, I don't match with the results obtained from my first > test. In other words, say I get the true positive against a threshold from > the second test as 8 , while the true positive from the first test against > the same threshold is 260. I am getting similar observations for the false > positive results wherein if I multiply the results obtained in the second > test by 10, I don't come close to the results obtained from the first set. > > Is my expectation correct ? Is my way of executing the test (i.e., > reducing the the positive classes by 10 times and then feeding a class > weight of 10 times the negative classes) and comparing the results with a > model run without any class weight parameter correct ? > > Please let me know as per your convenience as this will help me a big way > to understand the concept further. > > Thanks in advance ! > > On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd > wrote: > >> The class_weight parameter doesn't behave the way you're expecting. >> >> The value in class_weight is the weight applied to each sample in that >> class - in your example, each class zero sample has weight 0.001 and each >> class one sample has weight 0.999, so each class one samples carries 999 >> times the weight of a class zero sample. >> >> If you would like each class one sample to have ten times the weight, you >> would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` >> equivalently. >> >> >> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh >> wrote: >> >>> Hi All, >>> Greetings ! >>> >>> I have a very basic question regarding the usage of the >>> parameter class_weight in scikit learn's Random Forest Classifier's fit >>> method. >>> >>> I have a fairly unbalanced sample and my positive class : >>> negative class ratio is 1:100. In other words, I have a million records >>> corresponding to negative class and 10,000 records corresponding to >>> positive class. I have trained the random forest classifier model using the >>> above record set successfully. >>> >>> Further, for a different problem, I want to test the >>> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >>> 1:0.999] and I have tried running my model on the same dataset as mentioned >>> in the above paragraph but with the positive class records reduced to 1000 >>> [because now each positive class is given approximately 10 times more >>> weight than a negative class]. However, the model run results are very very >>> different between the 2 runs (with and without class_weight). And I >>> expected a similar run results. >>> >>> Would you please be able to let me know where am I >>> getting wrong. I know it's something silly but just want to improve on my >>> concept. >>> >>> Thanks ! >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Mon Jan 23 19:48:33 2017 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Tue, 24 Jan 2017 06:18:33 +0530 Subject: [scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column Message-ID: Thanks Josh for your quick feedback ! It's quite helpful indeed . Further to it , I am having another burning question. In my sample dataset , I have 2 label columns (let's say x and y) My objective is to give the labels within column 'x' 10 times more weight as compared to labels within column y. My question is the parameter class_weight={0: 1, 1: 10} works for a single column, i.e., within a single column I have assigned 10 times weight to the positive labels. But my objective is to provide a 10 times weight to the positive labels within column 'x' as compared to the positive labels within column 'y'. May I please get a feedback from you around how to achieve this please. Thanks for your help in advance ! On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd wrote: > If you undersample, taking only 10% of the negative class, the classifier > will see different combinations of attributes and produce a different fit > to explain those distributions. In the worse case, imagine you are > classifying birds and through sampling you eliminate all `red` examples. > Your classifier likely now will not understand that red objects can be > birds. That's an overly simple example, but given a classifier capable of > exploring and explaining feature combinations, less obvious versions of > this are bound to happen. > > The extrapolation only works in the other direction: if you manually > duplicate samples by the sampling factor, you should get the exact same fit > as if you increased the class weight. > > Hope that helps, > Josh > > > On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh > wrote: > >> Thanks Josh ! >> >> I have used the parameter class_weight={0: 1, 1: 10} and the model code >> has run successfully. However, just to get a further clarity around it's >> concept, I am having another question for you please. I did the following 2 >> tests: >> >> 1. In my dataset , I have 1 million negative classes and 10,000 positive >> classes. First I ran my model code without supplying any class_weight >> parameter and it gave me certain True Positive and False Positive results. >> >> 2. Now in the second test, I had the same 1 million negative classes but >> reduced the positive classes to 1000 . But this time, I supplied the >> parameter class_weight={0: 1, 1: 10} and got my True Positive and False >> Positive Results >> >> My question is , when I multiply the results obtained from my second test >> with a factor of 10, I don't match with the results obtained from my first >> test. In other words, say I get the true positive against a threshold from >> the second test as 8 , while the true positive from the first test against >> the same threshold is 260. I am getting similar observations for the false >> positive results wherein if I multiply the results obtained in the second >> test by 10, I don't come close to the results obtained from the first set. >> >> Is my expectation correct ? Is my way of executing the test (i.e., >> reducing the the positive classes by 10 times and then feeding a class >> weight of 10 times the negative classes) and comparing the results with a >> model run without any class weight parameter correct ? >> >> Please let me know as per your convenience as this will help me a big way >> to understand the concept further. >> >> Thanks in advance ! >> >> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd >> wrote: >> >>> The class_weight parameter doesn't behave the way you're expecting. >>> >>> The value in class_weight is the weight applied to each sample in that >>> class - in your example, each class zero sample has weight 0.001 and each >>> class one sample has weight 0.999, so each class one samples carries 999 >>> times the weight of a class zero sample. >>> >>> If you would like each class one sample to have ten times the weight, >>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` >>> equivalently. >>> >>> >>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh >> > wrote: >>> >>>> Hi All, >>>> Greetings ! >>>> >>>> I have a very basic question regarding the usage of the >>>> parameter class_weight in scikit learn's Random Forest Classifier's fit >>>> method. >>>> >>>> I have a fairly unbalanced sample and my positive class : >>>> negative class ratio is 1:100. In other words, I have a million records >>>> corresponding to negative class and 10,000 records corresponding to >>>> positive class. I have trained the random forest classifier model using the >>>> above record set successfully. >>>> >>>> Further, for a different problem, I want to test the >>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >>>> 1:0.999] and I have tried running my model on the same dataset as mentioned >>>> in the above paragraph but with the positive class records reduced to 1000 >>>> [because now each positive class is given approximately 10 times more >>>> weight than a negative class]. However, the model run results are very very >>>> different between the 2 runs (with and without class_weight). And I >>>> expected a similar run results. >>>> >>>> Would you please be able to let me know where am I >>>> getting wrong. I know it's something silly but just want to improve on my >>>> concept. >>>> >>>> Thanks ! >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cleverless at gmail.com Mon Jan 23 20:28:18 2017 From: cleverless at gmail.com (Josh Vredevoogd) Date: Mon, 23 Jan 2017 17:28:18 -0800 Subject: [scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column In-Reply-To: References: Message-ID: If you do not want the weights to be uniform by class, then you need to generate weights for each sample and pass the sample weight vector to the fit method of the classifier. On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh wrote: > Thanks Josh for your quick feedback ! It's quite helpful indeed . > > Further to it , I am having another burning question. In my sample dataset > , I have 2 label columns (let's say x and y) > > My objective is to give the labels within column 'x' 10 times more weight > as compared to labels within column y. > > My question is the parameter class_weight={0: 1, 1: 10} works for a single > column, i.e., within a single column I have assigned 10 times weight to the > positive labels. > > But my objective is to provide a 10 times weight to the positive labels > within column 'x' as compared to the positive labels within column 'y'. > > May I please get a feedback from you around how to achieve this please. > Thanks for your help in advance ! > > On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd > wrote: > >> If you undersample, taking only 10% of the negative class, the classifier >> will see different combinations of attributes and produce a different fit >> to explain those distributions. In the worse case, imagine you are >> classifying birds and through sampling you eliminate all `red` examples. >> Your classifier likely now will not understand that red objects can be >> birds. That's an overly simple example, but given a classifier capable of >> exploring and explaining feature combinations, less obvious versions of >> this are bound to happen. >> >> The extrapolation only works in the other direction: if you manually >> duplicate samples by the sampling factor, you should get the exact same fit >> as if you increased the class weight. >> >> Hope that helps, >> Josh >> >> >> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh >> wrote: >> >>> Thanks Josh ! >>> >>> I have used the parameter class_weight={0: 1, 1: 10} and the model code >>> has run successfully. However, just to get a further clarity around it's >>> concept, I am having another question for you please. I did the following 2 >>> tests: >>> >>> 1. In my dataset , I have 1 million negative classes and 10,000 positive >>> classes. First I ran my model code without supplying any class_weight >>> parameter and it gave me certain True Positive and False Positive results. >>> >>> 2. Now in the second test, I had the same 1 million negative classes but >>> reduced the positive classes to 1000 . But this time, I supplied the >>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False >>> Positive Results >>> >>> My question is , when I multiply the results obtained from my second >>> test with a factor of 10, I don't match with the results obtained from my >>> first test. In other words, say I get the true positive against a threshold >>> from the second test as 8 , while the true positive from the first test >>> against the same threshold is 260. I am getting similar observations for >>> the false positive results wherein if I multiply the results obtained in >>> the second test by 10, I don't come close to the results obtained from the >>> first set. >>> >>> Is my expectation correct ? Is my way of executing the test (i.e., >>> reducing the the positive classes by 10 times and then feeding a class >>> weight of 10 times the negative classes) and comparing the results with a >>> model run without any class weight parameter correct ? >>> >>> Please let me know as per your convenience as this will help me a big >>> way to understand the concept further. >>> >>> Thanks in advance ! >>> >>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd >>> wrote: >>> >>>> The class_weight parameter doesn't behave the way you're expecting. >>>> >>>> The value in class_weight is the weight applied to each sample in that >>>> class - in your example, each class zero sample has weight 0.001 and each >>>> class one sample has weight 0.999, so each class one samples carries 999 >>>> times the weight of a class zero sample. >>>> >>>> If you would like each class one sample to have ten times the weight, >>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` >>>> equivalently. >>>> >>>> >>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh < >>>> mailfordebu at gmail.com> wrote: >>>> >>>>> Hi All, >>>>> Greetings ! >>>>> >>>>> I have a very basic question regarding the usage of the >>>>> parameter class_weight in scikit learn's Random Forest Classifier's fit >>>>> method. >>>>> >>>>> I have a fairly unbalanced sample and my positive class >>>>> : negative class ratio is 1:100. In other words, I have a million records >>>>> corresponding to negative class and 10,000 records corresponding to >>>>> positive class. I have trained the random forest classifier model using the >>>>> above record set successfully. >>>>> >>>>> Further, for a different problem, I want to test the >>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned >>>>> in the above paragraph but with the positive class records reduced to 1000 >>>>> [because now each positive class is given approximately 10 times more >>>>> weight than a negative class]. However, the model run results are very very >>>>> different between the 2 runs (with and without class_weight). And I >>>>> expected a similar run results. >>>>> >>>>> Would you please be able to let me know where am I >>>>> getting wrong. I know it's something silly but just want to improve on my >>>>> concept. >>>>> >>>>> Thanks ! >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Tue Jan 24 02:36:57 2017 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Tue, 24 Jan 2017 13:06:57 +0530 Subject: [scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column In-Reply-To: References: Message-ID: What would be the sample command for achieving it ? Sorry a bit new in this area and that's why I will be better able to understand it through certain example commands . Thanks again ! On Tue, Jan 24, 2017 at 6:58 AM, Josh Vredevoogd wrote: > If you do not want the weights to be uniform by class, then you need to > generate weights for each sample and pass the sample weight vector to the > fit method of the classifier. > > On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh > wrote: > >> Thanks Josh for your quick feedback ! It's quite helpful indeed . >> >> Further to it , I am having another burning question. In my sample >> dataset , I have 2 label columns (let's say x and y) >> >> My objective is to give the labels within column 'x' 10 times more weight >> as compared to labels within column y. >> >> My question is the parameter class_weight={0: 1, 1: 10} works for a >> single column, i.e., within a single column I have assigned 10 times weight >> to the positive labels. >> >> But my objective is to provide a 10 times weight to the positive labels >> within column 'x' as compared to the positive labels within column 'y'. >> >> May I please get a feedback from you around how to achieve this please. >> Thanks for your help in advance ! >> >> On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd >> wrote: >> >>> If you undersample, taking only 10% of the negative class, the >>> classifier will see different combinations of attributes and produce a >>> different fit to explain those distributions. In the worse case, imagine >>> you are classifying birds and through sampling you eliminate all `red` >>> examples. Your classifier likely now will not understand that red objects >>> can be birds. That's an overly simple example, but given a classifier >>> capable of exploring and explaining feature combinations, less obvious >>> versions of this are bound to happen. >>> >>> The extrapolation only works in the other direction: if you manually >>> duplicate samples by the sampling factor, you should get the exact same fit >>> as if you increased the class weight. >>> >>> Hope that helps, >>> Josh >>> >>> >>> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh >>> wrote: >>> >>>> Thanks Josh ! >>>> >>>> I have used the parameter class_weight={0: 1, 1: 10} and the model code >>>> has run successfully. However, just to get a further clarity around it's >>>> concept, I am having another question for you please. I did the following 2 >>>> tests: >>>> >>>> 1. In my dataset , I have 1 million negative classes and 10,000 >>>> positive classes. First I ran my model code without supplying any >>>> class_weight parameter and it gave me certain True Positive and False >>>> Positive results. >>>> >>>> 2. Now in the second test, I had the same 1 million negative classes >>>> but reduced the positive classes to 1000 . But this time, I supplied the >>>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False >>>> Positive Results >>>> >>>> My question is , when I multiply the results obtained from my second >>>> test with a factor of 10, I don't match with the results obtained from my >>>> first test. In other words, say I get the true positive against a threshold >>>> from the second test as 8 , while the true positive from the first test >>>> against the same threshold is 260. I am getting similar observations for >>>> the false positive results wherein if I multiply the results obtained in >>>> the second test by 10, I don't come close to the results obtained from the >>>> first set. >>>> >>>> Is my expectation correct ? Is my way of executing the test (i.e., >>>> reducing the the positive classes by 10 times and then feeding a class >>>> weight of 10 times the negative classes) and comparing the results with a >>>> model run without any class weight parameter correct ? >>>> >>>> Please let me know as per your convenience as this will help me a big >>>> way to understand the concept further. >>>> >>>> Thanks in advance ! >>>> >>>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd >>>> wrote: >>>> >>>>> The class_weight parameter doesn't behave the way you're expecting. >>>>> >>>>> The value in class_weight is the weight applied to each sample in that >>>>> class - in your example, each class zero sample has weight 0.001 and each >>>>> class one sample has weight 0.999, so each class one samples carries 999 >>>>> times the weight of a class zero sample. >>>>> >>>>> If you would like each class one sample to have ten times the weight, >>>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` >>>>> equivalently. >>>>> >>>>> >>>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh < >>>>> mailfordebu at gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> Greetings ! >>>>>> >>>>>> I have a very basic question regarding the usage of the >>>>>> parameter class_weight in scikit learn's Random Forest Classifier's fit >>>>>> method. >>>>>> >>>>>> I have a fairly unbalanced sample and my positive class >>>>>> : negative class ratio is 1:100. In other words, I have a million records >>>>>> corresponding to negative class and 10,000 records corresponding to >>>>>> positive class. I have trained the random forest classifier model using the >>>>>> above record set successfully. >>>>>> >>>>>> Further, for a different problem, I want to test the >>>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >>>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned >>>>>> in the above paragraph but with the positive class records reduced to 1000 >>>>>> [because now each positive class is given approximately 10 times more >>>>>> weight than a negative class]. However, the model run results are very very >>>>>> different between the 2 runs (with and without class_weight). And I >>>>>> expected a similar run results. >>>>>> >>>>>> Would you please be able to let me know where am I >>>>>> getting wrong. I know it's something silly but just want to improve on my >>>>>> concept. >>>>>> >>>>>> Thanks ! >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naopon at gmail.com Tue Jan 24 02:51:43 2017 From: naopon at gmail.com (Naoya Kanai) Date: Mon, 23 Jan 2017 23:51:43 -0800 Subject: [scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column In-Reply-To: References: Message-ID: You need to write your own function to compute a vector assigning a weight to each sample in X, then pass that as sample_weight parameter on RandomForestClassifier.fit() . If you also use class_weight on the model constructor, class_weight and sample_weight are multiplied through for each sample. On Mon, Jan 23, 2017 at 11:36 PM, Debabrata Ghosh wrote: > What would be the sample command for achieving it ? Sorry a bit new in > this area and that's why I will be better able to understand it through > certain example commands . > > Thanks again ! > > On Tue, Jan 24, 2017 at 6:58 AM, Josh Vredevoogd > wrote: > >> If you do not want the weights to be uniform by class, then you need to >> generate weights for each sample and pass the sample weight vector to the >> fit method of the classifier. >> >> On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh >> wrote: >> >>> Thanks Josh for your quick feedback ! It's quite helpful indeed . >>> >>> Further to it , I am having another burning question. In my sample >>> dataset , I have 2 label columns (let's say x and y) >>> >>> My objective is to give the labels within column 'x' 10 times more >>> weight as compared to labels within column y. >>> >>> My question is the parameter class_weight={0: 1, 1: 10} works for a >>> single column, i.e., within a single column I have assigned 10 times weight >>> to the positive labels. >>> >>> But my objective is to provide a 10 times weight to the positive labels >>> within column 'x' as compared to the positive labels within column 'y'. >>> >>> May I please get a feedback from you around how to achieve this please. >>> Thanks for your help in advance ! >>> >>> On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd >>> wrote: >>> >>>> If you undersample, taking only 10% of the negative class, the >>>> classifier will see different combinations of attributes and produce a >>>> different fit to explain those distributions. In the worse case, imagine >>>> you are classifying birds and through sampling you eliminate all `red` >>>> examples. Your classifier likely now will not understand that red objects >>>> can be birds. That's an overly simple example, but given a classifier >>>> capable of exploring and explaining feature combinations, less obvious >>>> versions of this are bound to happen. >>>> >>>> The extrapolation only works in the other direction: if you manually >>>> duplicate samples by the sampling factor, you should get the exact same fit >>>> as if you increased the class weight. >>>> >>>> Hope that helps, >>>> Josh >>>> >>>> >>>> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh >>> > wrote: >>>> >>>>> Thanks Josh ! >>>>> >>>>> I have used the parameter class_weight={0: 1, 1: 10} and the model >>>>> code has run successfully. However, just to get a further clarity around >>>>> it's concept, I am having another question for you please. I did the >>>>> following 2 tests: >>>>> >>>>> 1. In my dataset , I have 1 million negative classes and 10,000 >>>>> positive classes. First I ran my model code without supplying any >>>>> class_weight parameter and it gave me certain True Positive and False >>>>> Positive results. >>>>> >>>>> 2. Now in the second test, I had the same 1 million negative classes >>>>> but reduced the positive classes to 1000 . But this time, I supplied the >>>>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False >>>>> Positive Results >>>>> >>>>> My question is , when I multiply the results obtained from my second >>>>> test with a factor of 10, I don't match with the results obtained from my >>>>> first test. In other words, say I get the true positive against a threshold >>>>> from the second test as 8 , while the true positive from the first test >>>>> against the same threshold is 260. I am getting similar observations for >>>>> the false positive results wherein if I multiply the results obtained in >>>>> the second test by 10, I don't come close to the results obtained from the >>>>> first set. >>>>> >>>>> Is my expectation correct ? Is my way of executing the test (i.e., >>>>> reducing the the positive classes by 10 times and then feeding a class >>>>> weight of 10 times the negative classes) and comparing the results with a >>>>> model run without any class weight parameter correct ? >>>>> >>>>> Please let me know as per your convenience as this will help me a big >>>>> way to understand the concept further. >>>>> >>>>> Thanks in advance ! >>>>> >>>>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd >>>> > wrote: >>>>> >>>>>> The class_weight parameter doesn't behave the way you're expecting. >>>>>> >>>>>> The value in class_weight is the weight applied to each sample in >>>>>> that class - in your example, each class zero sample has weight 0.001 and >>>>>> each class one sample has weight 0.999, so each class one samples carries >>>>>> 999 times the weight of a class zero sample. >>>>>> >>>>>> If you would like each class one sample to have ten times the weight, >>>>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}` >>>>>> equivalently. >>>>>> >>>>>> >>>>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh < >>>>>> mailfordebu at gmail.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> Greetings ! >>>>>>> >>>>>>> I have a very basic question regarding the usage of >>>>>>> the parameter class_weight in scikit learn's Random Forest Classifier's fit >>>>>>> method. >>>>>>> >>>>>>> I have a fairly unbalanced sample and my positive >>>>>>> class : negative class ratio is 1:100. In other words, I have a million >>>>>>> records corresponding to negative class and 10,000 records corresponding to >>>>>>> positive class. I have trained the random forest classifier model using the >>>>>>> above record set successfully. >>>>>>> >>>>>>> Further, for a different problem, I want to test the >>>>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 , >>>>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned >>>>>>> in the above paragraph but with the positive class records reduced to 1000 >>>>>>> [because now each positive class is given approximately 10 times more >>>>>>> weight than a negative class]. However, the model run results are very very >>>>>>> different between the 2 runs (with and without class_weight). And I >>>>>>> expected a similar run results. >>>>>>> >>>>>>> Would you please be able to let me know where am I >>>>>>> getting wrong. I know it's something silly but just want to improve on my >>>>>>> concept. >>>>>>> >>>>>>> Thanks ! >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Jan 26 10:51:54 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 26 Jan 2017 10:51:54 -0500 Subject: [scikit-learn] (personal) Survey for future scikit-learn development Message-ID: Hey all. I created a survey to prioritize and justify (to people that give me money) future scikit-learn development. It would be great if you could answer it, it should be pretty sort (it's 10 questions, mostly multiple choice). Feel free to share, more replies are better ;) https://www.surveymonkey.com/r/GJFK32S Disclaimer: While I will share the results with the project (and anyone that cares), I want to make clear that that this is not an "official scikit-learn survey" in that it wasn't designed or endorsed by the whole team. Also, the results are not binding for anyone, though I will use them to steer my work and projects, and I think the rest of the project will certainly take the input into account. Thanks! Andy From raga.markely at gmail.com Thu Jan 26 11:02:48 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 26 Jan 2017 11:02:48 -0500 Subject: [scikit-learn] Scores in Cross Validation Message-ID: Hello, I have 2 questions regarding cross_val_score. 1. Do the scores returned by cross_val_score correspond to only the test set or the whole data set (training and test sets)? I tried to look at the source code, and it looks like it returns the score of only the test set (line 145: "return_train_score=False") - I am not sure if I am reading the codes properly, though.. https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/model_ selection/_validation.py#L36 I came across the paper below and the authors use the score of the whole dataset when the author performs repeated nested loop, grid search cv, etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 I wonder what's the pros and cons of using the accuracy score of the whole dataset vs just the test set.. any thoughts? 2. On line 283 of the cross_val_score source code, there is a function _score. However, I can't find where this function is called. Could you let me know where this function is called? Thank you very much! Raga -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Jan 26 12:05:12 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 26 Jan 2017 18:05:12 +0100 Subject: [scikit-learn] Scores in Cross Validation In-Reply-To: References: Message-ID: 1. You should not evaluate an estimator on the data which have been used to train it. Usually, you try to minimize the classification or loss using those data and fit them as good as possible. Evaluating on an unseen testing set will give you an idea how good your estimator was able to generalize to your problem during the training. Furthermore, a training, validation, and testing set should be used when setting up parameters. Validation will be used to set the parameters and the testing will be used to evaluate your best estimator. That is why, when using the GridSearchCV, fit will train the estimator using a training and validation test (using a given CV startegies). Finally, predict will be performed on another unseen testing set. The bottom line is that using training data to select parameters will not ensure that you are selecting the best parameters for your problems. 2. The function is call in _fit_and_score, l. 260 and 263 for instance. On 26 January 2017 at 17:02, Raga Markely wrote: > Hello, > > I have 2 questions regarding cross_val_score. > 1. Do the scores returned by cross_val_score correspond to only the test > set or the whole data set (training and test sets)? > I tried to look at the source code, and it looks like it returns the score > of only the test set (line 145: "return_train_score=False") - I am not sure > if I am reading the codes properly, though.. > https://github.com/scikit-learn/scikit-learn/blob/14031f6/ > sklearn/model_selection/_validation.py#L36 > I came across the paper below and the authors use the score of the whole > dataset when the author performs repeated nested loop, grid search cv, > etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. > https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 > I wonder what's the pros and cons of using the accuracy score of the whole > dataset vs just the test set.. any thoughts? > > 2. On line 283 of the cross_val_score source code, there is a function > _score. However, I can't find where this function is called. Could you let > me know where this function is called? > > Thank you very much! > Raga > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Thu Jan 26 13:19:39 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 26 Jan 2017 13:19:39 -0500 Subject: [scikit-learn] Scores in Cross Validation In-Reply-To: References: Message-ID: Thank you, Guillaume. 1. I agree with you - that's what I have been learning and makes sense.. I was a bit surprised when I read the paper today.. 2. Ah.. thank you.. I got to change my glasses :P Best, Raga *Guillaume Lema?tre* g.lemaitre58 at gmail.com *Thu Jan 26 12:05:12 EST 2017* - Previous message (by thread): [scikit-learn] Scores in Cross Validation - *Messages sorted by:* [ date ] [ thread ] [ subject ] [ author ] ------------------------------ 1. You should not evaluate an estimator on the data which have been used to train it. Usually, you try to minimize the classification or loss using those data and fit them as good as possible. Evaluating on an unseen testing set will give you an idea how good your estimator was able to generalize to your problem during the training. Furthermore, a training, validation, and testing set should be used when setting up parameters. Validation will be used to set the parameters and the testing will be used to evaluate your best estimator. That is why, when using the GridSearchCV, fit will train the estimator using a training and validation test (using a given CV startegies). Finally, predict will be performed on another unseen testing set. The bottom line is that using training data to select parameters will not ensure that you are selecting the best parameters for your problems. 2. The function is call in _fit_and_score, l. 260 and 263 for instance. On 26 January 2017 at 17:02, Raga Markely > wrote: >* Hello, *>>* I have 2 questions regarding cross_val_score. *>* 1. Do the scores returned by cross_val_score correspond to only the test *>* set or the whole data set (training and test sets)? *>* I tried to look at the source code, and it looks like it returns the score *>* of only the test set (line 145: "return_train_score=False") - I am not sure *>* if I am reading the codes properly, though.. *>* https://github.com/scikit-learn/scikit-learn/blob/14031f6/ *>* sklearn/model_selection/_validation.py#L36 *>* I came across the paper below and the authors use the score of the whole *>* dataset when the author performs repeated nested loop, grid search cv, *>* etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. *>* https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 *>* I wonder what's the pros and cons of using the accuracy score of the whole *>* dataset vs just the test set.. any thoughts? *>>* 2. On line 283 of the cross_val_score source code, there is a function *>* _score. However, I can't find where this function is called. Could you let *>* me know where this function is called? *>>* Thank you very much! *>* Raga *>>* _______________________________________________ *>* scikit-learn mailing list *>* scikit-learn at python.org *>* https://mail.python.org/mailman/listinfo/scikit-learn *>> -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETALguillaume.lemaitre at inria.f >r ---https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jeremiah.Johnson at unh.edu Thu Jan 26 15:27:34 2017 From: Jeremiah.Johnson at unh.edu (Johnson, Jeremiah) Date: Thu, 26 Jan 2017 20:27:34 +0000 Subject: [scikit-learn] top N accuracy classification metric In-Reply-To: References: , Message-ID: <1485462271395.2206@unh.edu> Okay, I didn't see anything equivalent in the issue tracker, so submitted a pull request. Jeremiah =============================== Jeremiah W. Johnson, Ph. D Assistant Professor of Data Science Analytics Bachelor of Science Program Coordinator University of New Hampshire http://linkedin.com/jwjohnson314 ________________________________ From: scikit-learn on behalf of Joel Nothman Sent: Saturday, January 21, 2017 5:52 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] top N accuracy classification metric There are metrics with that kind of input in sklearn.metrics.ranking. I don't have the time to look them up now, but there have been proposals and PRs for similar ranking metrics. Please search the issue tracker for related issues. Thanks, Joel On 21 January 2017 at 06:16, Johnson, Jeremiah > wrote: Hi all, It's common to use a top-n accuracy metric for multi-class classification problems, where for each observation the prediction is the set of probabilities for each of the classes, and a prediction is top-N accurate if the correct class is among the N highest predicted probability classes. I've written a simple implementation, but I don't think it quite fits the sklearn api. Specifically, _check_targets objects to the the continuous-multioutput format of the predictions for a classification task. Is there any interest in including a metric like this? I'd be happy to submit a pull request. Jeremiah _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Thu Jan 26 17:39:41 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 26 Jan 2017 17:39:41 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV Message-ID: Hello, I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. Thank you very much! Raga -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Jan 26 18:34:25 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 26 Jan 2017 18:34:25 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: Message-ID: Hi, Raga, I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. Say you do 20 grid search repetitions, you could then do sth like: from sklearn.model_selection import StratifiedKFold for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ... Best, Sebastian > On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: > > Hello, > > I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. > > However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. > > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? > > Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. > > Thank you very much! > Raga > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Thu Jan 26 18:37:25 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 26 Jan 2017 18:37:25 -0500 Subject: [scikit-learn] Scores in Cross Validation In-Reply-To: References: Message-ID: <83184497-0DEB-44B8-9713-E9B87DBDE4F3@gmail.com> > Furthermore, a training, validation, and testing set should be used when > setting up > parameters. Usually, it?s better to use a train set and separate test set, and do model selection via k-fold on the training set. Then, you do the final model estimation on the test set that you haven?t touched before. I often use ?training, validation, and testing ? approach as well, though, especially when working with large datasets and for early stopping on neural nets. Best, Sebastian > On Jan 26, 2017, at 1:19 PM, Raga Markely wrote: > > Thank you, Guillaume. > > 1. I agree with you - that's what I have been learning and makes sense.. I was a bit surprised when I read the paper today.. > > 2. Ah.. thank you.. I got to change my glasses :P > > Best, > Raga > > Guillaume Lema?tre g.lemaitre58 at gmail.com > Thu Jan 26 12:05:12 EST 2017 > > ? Previous message (by thread): [scikit-learn] Scores in Cross Validation > ? Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > 1. You should not evaluate an estimator on the data which have been used to > train it. > Usually, you try to minimize the classification or loss using those data > and fit them as > good as possible. Evaluating on an unseen testing set will give you an idea > how good > your estimator was able to generalize to your problem during the training. > Furthermore, a training, validation, and testing set should be used when > setting up > parameters. Validation will be used to set the parameters and the testing > will be used > to evaluate your best estimator. > > That is why, when using the GridSearchCV, fit will train the estimator > using a training > and validation test (using a given CV startegies). Finally, predict will be > performed on > another unseen testing set. > > The bottom line is that using training data to select parameters will not > ensure that you > are selecting the best parameters for your problems. > > 2. The function is call in _fit_and_score, l. 260 and 263 for instance. > > On 26 January 2017 at 17:02, Raga Markely < > raga.markely at gmail.com > > wrote: > > > > Hello, > > > > > > I have 2 questions regarding cross_val_score. > > > > 1. Do the scores returned by cross_val_score correspond to only the test > > > > set or the whole data set (training and test sets)? > > > > I tried to look at the source code, and it looks like it returns the score > > > > of only the test set (line 145: "return_train_score=False") - I am not sure > > > > if I am reading the codes properly, though.. > > > https://github.com/scikit-learn/scikit-learn/blob/14031f6/ > > > sklearn/model_selection/_validation.py#L36 > > > > I came across the paper below and the authors use the score of the whole > > > > dataset when the author performs repeated nested loop, grid search cv, > > > > etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. > > > https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 > > > I wonder what's the pros and cons of using the accuracy score of the whole > > > > dataset vs just the test set.. any thoughts? > > > > > > 2. On line 283 of the cross_val_score source code, there is a function > > > > _score. However, I can't find where this function is called. Could you let > > > > me know where this function is called? > > > > > > Thank you very much! > > > > Raga > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > > guillaume.lemaitre at inria.f >r --- > > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From g.lemaitre58 at gmail.com Thu Jan 26 18:41:48 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 27 Jan 2017 00:41:48 +0100 Subject: [scikit-learn] Scores in Cross Validation In-Reply-To: <83184497-0DEB-44B8-9713-E9B87DBDE4F3@gmail.com> References: <83184497-0DEB-44B8-9713-E9B87DBDE4F3@gmail.com> Message-ID: I didn't express myself well but I was meaning: > model selection via k-fold on the training set for the training/validation set :D On 27 January 2017 at 00:37, Sebastian Raschka wrote: > > Furthermore, a training, validation, and testing set should be used when > > setting up > > parameters. > > Usually, it?s better to use a train set and separate test set, and do > model selection via k-fold on the training set. Then, you do the final > model estimation on the test set that you haven?t touched before. I often > use ?training, validation, and testing ? approach as well, though, > especially when working with large datasets and for early stopping on > neural nets. > > Best, > Sebastian > > > > On Jan 26, 2017, at 1:19 PM, Raga Markely > wrote: > > > > Thank you, Guillaume. > > > > 1. I agree with you - that's what I have been learning and makes sense.. > I was a bit surprised when I read the paper today.. > > > > 2. Ah.. thank you.. I got to change my glasses :P > > > > Best, > > Raga > > > > Guillaume Lema?tre g.lemaitre58 at gmail.com > > Thu Jan 26 12:05:12 EST 2017 > > > > ? Previous message (by thread): [scikit-learn] Scores in Cross > Validation > > ? Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > > 1. You should not evaluate an estimator on the data which have been used > to > > train it. > > Usually, you try to minimize the classification or loss using those data > > and fit them as > > good as possible. Evaluating on an unseen testing set will give you an > idea > > how good > > your estimator was able to generalize to your problem during the > training. > > Furthermore, a training, validation, and testing set should be used when > > setting up > > parameters. Validation will be used to set the parameters and the testing > > will be used > > to evaluate your best estimator. > > > > That is why, when using the GridSearchCV, fit will train the estimator > > using a training > > and validation test (using a given CV startegies). Finally, predict will > be > > performed on > > another unseen testing set. > > > > The bottom line is that using training data to select parameters will not > > ensure that you > > are selecting the best parameters for your problems. > > > > 2. The function is call in _fit_and_score, l. 260 and 263 for instance. > > > > On 26 January 2017 at 17:02, Raga Markely < > > raga.markely at gmail.com > > > wrote: > > > > > > > Hello, > > > > > > > > > > I have 2 questions regarding cross_val_score. > > > > > > > 1. Do the scores returned by cross_val_score correspond to only the test > > > > > > > set or the whole data set (training and test sets)? > > > > > > > I tried to look at the source code, and it looks like it returns the > score > > > > > > > of only the test set (line 145: "return_train_score=False") - I am not > sure > > > > > > > if I am reading the codes properly, though.. > > > > > https://github.com/scikit-learn/scikit-learn/blob/14031f6/ > > > > > sklearn/model_selection/_validation.py#L36 > > > > > > > I came across the paper below and the authors use the score of the whole > > > > > > > dataset when the author performs repeated nested loop, grid search cv, > > > > > > > etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. > > > > > https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 > > > > > I wonder what's the pros and cons of using the accuracy score of the > whole > > > > > > > dataset vs just the test set.. any thoughts? > > > > > > > > > > 2. On line 283 of the cross_val_score source code, there is a function > > > > > > > _score. However, I can't find where this function is called. Could you > let > > > > > > > me know where this function is called? > > > > > > > > > > Thank you very much! > > > > > > > Raga > > > > > > > > > > _______________________________________________ > > > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > Guillaume Lemaitre > > INRIA Saclay - Ile-de-France > > Equipe PARIETAL > > > > guillaume.lemaitre at inria.f > >r --- > > > > https://glemaitre.github.io/ > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Thu Jan 26 20:06:06 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 26 Jan 2017 20:06:06 -0500 Subject: [scikit-learn] Scores in Cross Validation In-Reply-To: References: <83184497-0DEB-44B8-9713-E9B87DBDE4F3@gmail.com> Message-ID: Got it.. thank you for the clarification, Sebastian & Guillaume.. appreciate it! Best, Raga On Thu, Jan 26, 2017 at 6:41 PM, Guillaume Lema?tre wrote: > I didn't express myself well but I was meaning: > > > model selection via k-fold on the training set > > for the training/validation set :D > > On 27 January 2017 at 00:37, Sebastian Raschka > wrote: > >> > Furthermore, a training, validation, and testing set should be used when >> > setting up >> > parameters. >> >> Usually, it?s better to use a train set and separate test set, and do >> model selection via k-fold on the training set. Then, you do the final >> model estimation on the test set that you haven?t touched before. I often >> use ?training, validation, and testing ? approach as well, though, >> especially when working with large datasets and for early stopping on >> neural nets. >> >> Best, >> Sebastian >> >> >> > On Jan 26, 2017, at 1:19 PM, Raga Markely >> wrote: >> > >> > Thank you, Guillaume. >> > >> > 1. I agree with you - that's what I have been learning and makes >> sense.. I was a bit surprised when I read the paper today.. >> > >> > 2. Ah.. thank you.. I got to change my glasses :P >> > >> > Best, >> > Raga >> > >> > Guillaume Lema?tre g.lemaitre58 at gmail.com >> > Thu Jan 26 12:05:12 EST 2017 >> > >> > ? Previous message (by thread): [scikit-learn] Scores in Cross >> Validation >> > ? Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] >> > 1. You should not evaluate an estimator on the data which have been >> used to >> > train it. >> > Usually, you try to minimize the classification or loss using those data >> > and fit them as >> > good as possible. Evaluating on an unseen testing set will give you an >> idea >> > how good >> > your estimator was able to generalize to your problem during the >> training. >> > Furthermore, a training, validation, and testing set should be used when >> > setting up >> > parameters. Validation will be used to set the parameters and the >> testing >> > will be used >> > to evaluate your best estimator. >> > >> > That is why, when using the GridSearchCV, fit will train the estimator >> > using a training >> > and validation test (using a given CV startegies). Finally, predict >> will be >> > performed on >> > another unseen testing set. >> > >> > The bottom line is that using training data to select parameters will >> not >> > ensure that you >> > are selecting the best parameters for your problems. >> > >> > 2. The function is call in _fit_and_score, l. 260 and 263 for instance. >> > >> > On 26 January 2017 at 17:02, Raga Markely < >> > raga.markely at gmail.com >> > > wrote: >> > >> > > >> > Hello, >> > >> > > >> > > >> > I have 2 questions regarding cross_val_score. >> > >> > > >> > 1. Do the scores returned by cross_val_score correspond to only the >> test >> > >> > > >> > set or the whole data set (training and test sets)? >> > >> > > >> > I tried to look at the source code, and it looks like it returns the >> score >> > >> > > >> > of only the test set (line 145: "return_train_score=False") - I am not >> sure >> > >> > > >> > if I am reading the codes properly, though.. >> > >> > > https://github.com/scikit-learn/scikit-learn/blob/14031f6/ >> > > >> > sklearn/model_selection/_validation.py#L36 >> > >> > > >> > I came across the paper below and the authors use the score of the >> whole >> > >> > > >> > dataset when the author performs repeated nested loop, grid search cv, >> > >> > > >> > etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3. >> > >> > > https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10 >> > > >> > I wonder what's the pros and cons of using the accuracy score of the >> whole >> > >> > > >> > dataset vs just the test set.. any thoughts? >> > >> > > >> > > >> > 2. On line 283 of the cross_val_score source code, there is a function >> > >> > > >> > _score. However, I can't find where this function is called. Could you >> let >> > >> > > >> > me know where this function is called? >> > >> > > >> > > >> > Thank you very much! >> > >> > > >> > Raga >> > >> > > >> > > >> > _______________________________________________ >> > >> > > >> > scikit-learn mailing list >> > >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > > >> > >> > >> > -- >> > Guillaume Lemaitre >> > INRIA Saclay - Ile-de-France >> > Equipe PARIETAL >> > >> > guillaume.lemaitre at inria.f > > >r --- >> > >> > https://glemaitre.github.io/ >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Thu Jan 26 20:09:52 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 26 Jan 2017 20:09:52 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: Message-ID: Ahh.. nice.. I will use that.. thanks a lot, Sebastian! Best, Raga On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: > Hi, Raga, > > I think that if GridSearchCV is used for classification, the stratified > k-fold doesn?t do shuffling by default. > > Say you do 20 grid search repetitions, you could then do sth like: > > > from sklearn.model_selection import StratifiedKFold > > for i in range(n_reps): > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > gs = GridSearchCV(..., cv=k_fold) > ... > > Best, > Sebastian > > > On Jan 26, 2017, at 5:39 PM, Raga Markely > wrote: > > > > Hello, > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought that > each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > > > However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > > > Just a note, I used the following classifiers: Logistic Regression, KNN, > SVC, Kernel SVC, Random Forest, and had the same observation regardless of > the classifiers. > > > > Thank you very much! > > Raga > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Jan 26 20:31:20 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 26 Jan 2017 20:31:20 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: Message-ID: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> You are welcome! And in addition, if you select among different algorithms, here are some more suggestions a) don?t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it?s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I?d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. Best, Sebastian > On Jan 26, 2017, at 8:09 PM, Raga Markely wrote: > > Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > Best, > Raga > > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: > Hi, Raga, > > I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. > > Say you do 20 grid search repetitions, you could then do sth like: > > > from sklearn.model_selection import StratifiedKFold > > for i in range(n_reps): > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > gs = GridSearchCV(..., cv=k_fold) > ... > > Best, > Sebastian > > > On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: > > > > Hello, > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. > > > > However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? > > > > Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. > > > > Thank you very much! > > Raga > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Fri Jan 27 10:23:42 2017 From: raga.markely at gmail.com (Raga Markely) Date: Fri, 27 Jan 2017 10:23:42 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> Message-ID: Sounds good, Sebastian.. thanks for the suggestions.. My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region This sounds reasonable? Thank you very much! Raga On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka wrote: > You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > > a) don?t do it based on your independent test set if this is going to your > final model performance estimate, or be aware that it would be overly > optimistic > b) also, it?s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > > But yeah, it all depends on your dataset and size. If you have a neural > net that takes week to train, and if you have a large dataset anyway so > that you can set aside large sets for testing, I?d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > > Best, > Sebastian > > > On Jan 26, 2017, at 8:09 PM, Raga Markely > wrote: > > > > Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > > > Best, > > Raga > > > > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka > wrote: > > Hi, Raga, > > > > I think that if GridSearchCV is used for classification, the stratified > k-fold doesn?t do shuffling by default. > > > > Say you do 20 grid search repetitions, you could then do sth like: > > > > > > from sklearn.model_selection import StratifiedKFold > > > > for i in range(n_reps): > > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > gs = GridSearchCV(..., cv=k_fold) > > ... > > > > Best, > > Sebastian > > > > > On Jan 26, 2017, at 5:39 PM, Raga Markely > wrote: > > > > > > Hello, > > > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > > > > > However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > > > > > Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > > > > > > Thank you very much! > > > Raga > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Jan 27 12:49:50 2017 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 27 Jan 2017 12:49:50 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> Message-ID: <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> Hi, Raga, sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. Not saying that this is the optimal/right approach, but I usually do it like this: 1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done Best, Sebastian > On Jan 27, 2017, at 10:23 AM, Raga Markely wrote: > > Sounds good, Sebastian.. thanks for the suggestions.. > > My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. > 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. > 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers > 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region > > This sounds reasonable? > > Thank you very much! > Raga > > On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka wrote: > You are welcome! And in addition, if you select among different algorithms, here are some more suggestions > > a) don?t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic > b) also, it?s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) > > But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I?d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. > > Best, > Sebastian > > > On Jan 26, 2017, at 8:09 PM, Raga Markely wrote: > > > > Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > > > Best, > > Raga > > > > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: > > Hi, Raga, > > > > I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. > > > > Say you do 20 grid search repetitions, you could then do sth like: > > > > > > from sklearn.model_selection import StratifiedKFold > > > > for i in range(n_reps): > > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > gs = GridSearchCV(..., cv=k_fold) > > ... > > > > Best, > > Sebastian > > > > > On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: > > > > > > Hello, > > > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. > > > > > > However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. > > > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? > > > > > > Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. > > > > > > Thank you very much! > > > Raga > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Fri Jan 27 13:01:26 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 27 Jan 2017 13:01:26 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> Message-ID: <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Hi, Raga, sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. Not saying that this is the optimal/right approach, but I usually do it like this: 1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done Best, Sebastian > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka wrote: > > Hi, Raga, > > sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. > > Not saying that this is the optimal/right approach, but I usually do it like this: > > 1.) algo selection via nested cv > 2.) model selection based on best algo via k-fold on whole training set > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > 4.) evaluate on test set > 5.) fit classifier to whole dataset, done > > Best, > Sebastian > >> On Jan 27, 2017, at 10:23 AM, Raga Markely wrote: >> >> Sounds good, Sebastian.. thanks for the suggestions.. >> >> My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. >> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. >> 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers >> 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region >> >> This sounds reasonable? >> >> Thank you very much! >> Raga >> >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka wrote: >> You are welcome! And in addition, if you select among different algorithms, here are some more suggestions >> >> a) don?t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic >> b) also, it?s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) >> >> But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I?d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. >> >> Best, >> Sebastian >> >>> On Jan 26, 2017, at 8:09 PM, Raga Markely wrote: >>> >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! >>> >>> Best, >>> Raga >>> >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: >>> Hi, Raga, >>> >>> I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. >>> >>> Say you do 20 grid search repetitions, you could then do sth like: >>> >>> >>> from sklearn.model_selection import StratifiedKFold >>> >>> for i in range(n_reps): >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) >>> gs = GridSearchCV(..., cv=k_fold) >>> ... >>> >>> Best, >>> Sebastian >>> >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: >>>> >>>> Hello, >>>> >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. >>>> >>>> However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. >>>> >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? >>>> >>>> Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. >>>> >>>> Thank you very much! >>>> Raga >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Fri Jan 27 13:16:29 2017 From: raga.markely at gmail.com (Raga Markely) Date: Fri, 27 Jan 2017 13:16:29 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Message-ID: Hi Sebastian, Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow.. Thank you very much for your help! Have a good weekend, Raga On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka wrote: > Hi, Raga, > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > 1.) algo selection via nested cv > 2.) model selection based on best algo via k-fold on whole training set > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > 4.) evaluate on test set > 5.) fit classifier to whole dataset, done > > Best, > Sebastian > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely > wrote: > >> > >> Sounds good, Sebastian.. thanks for the suggestions.. > >> > >> My dataset is relatively small (only ~35 samples), and this is the > workflow I have set up so far.. > >> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) > same as shown in the scikit-learn page that you provided - the results show > no statistically significant difference in accuracy mean +/- SD among > classifiers.. this is expected as the pattern is pretty obvious and simple > to separate by eyes after dimensionality reduction (I use pipeline of > stdscaler, LDA, and classifier)... so i take all of them and use voting > classifier in step #3.. > >> 2. Hyperparameter optimization: use GridSearchCV to optimize > hyperparameters of each classifiers > >> 3. Decision Region: use the hyperparameters from step #2, fit each > classifier separately to the whole dataset, and use voting classifier to > get decision region > >> > >> This sounds reasonable? > >> > >> Thank you very much! > >> Raga > >> > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > >> You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > >> > >> a) don?t do it based on your independent test set if this is going to > your final model performance estimate, or be aware that it would be overly > optimistic > >> b) also, it?s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > >> > >> But yeah, it all depends on your dataset and size. If you have a neural > net that takes week to train, and if you have a large dataset anyway so > that you can set aside large sets for testing, I?d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > >> > >> Best, > >> Sebastian > >> > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely > wrote: > >>> > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > >>> > >>> Best, > >>> Raga > >>> > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > >>> Hi, Raga, > >>> > >>> I think that if GridSearchCV is used for classification, the > stratified k-fold doesn?t do shuffling by default. > >>> > >>> Say you do 20 grid search repetitions, you could then do sth like: > >>> > >>> > >>> from sklearn.model_selection import StratifiedKFold > >>> > >>> for i in range(n_reps): > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > >>> gs = GridSearchCV(..., cv=k_fold) > >>> ... > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely > wrote: > >>>> > >>>> Hello, > >>>> > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > >>>> > >>>> However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > >>>> > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > >>>> > >>>> Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > >>>> > >>>> Thank you very much! > >>>> Raga > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexeyum at yandex.ru Sun Jan 29 11:06:51 2017 From: alexeyum at yandex.ru (=?utf-8?B?0KPQvNC90L7QsiDQkNC70LXQutGB0LXQuSAoQWxleGV5IFVtbm92KQ==?=) Date: Sun, 29 Jan 2017 19:06:51 +0300 Subject: [scikit-learn] K-SVD implementation PR (needs 2nd review) Message-ID: <7195131485706011@web23o.yandex.ru> An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jan 30 08:38:54 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 30 Jan 2017 08:38:54 -0500 Subject: [scikit-learn] GSOC call for mentors In-Reply-To: References: Message-ID: Hey all. It's that time of the year again. Are we planning on participating in GSOC? If so, we need mentors and projects. It's unlikely that I'll have time to help with either in any substantial way. If we want to participate, I think we should try to be a bit more organized than last year ;) Andy Sent from phone. Please excuse spelling and brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jan 30 13:09:13 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 30 Jan 2017 10:09:13 -0800 Subject: [scikit-learn] GSOC call for mentors In-Reply-To: References: Message-ID: I discussed this briefly with Gael and Joel. The consensus was that unless we already know excellent students who will fit well that it is unlikely we will participate in GSoC. That being said, if someone (other than me) is willing to step up and organize it, I'd volunteer to be a mentor again. I think an important project would be adding multithreading to individual tree building so we can do gradient boosting in parallel. On Mon, Jan 30, 2017 at 5:38 AM, Andreas Mueller wrote: > Hey all. > It's that time of the year again. > Are we planning on participating in GSOC? > If so, we need mentors and projects. > It's unlikely that I'll have time to help with either in any substantial > way. > If we want to participate, I think we should try to be a bit more > organized than last year ;) > > Andy > > Sent from phone. Please excuse spelling and brevity. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Mon Jan 30 14:48:32 2017 From: raga.markely at gmail.com (Raga Markely) Date: Mon, 30 Jan 2017 14:48:32 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Message-ID: Hi Sebastian, Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: N_outer=10 N_inner=10 scores=[] for i in range(N_outer): k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) for j in range(N_inner): k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) scores.append(score) np.mean(scores) np.std(scores) But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... Could you give me some tips on what I can do? Thank you! Raga On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely wrote: > Hi Sebastian, > > Sorry, I used the wrong terms (I was referring to algo as model).. great > then, i think what i have is aligned with your workflow.. > > Thank you very much for your help! > > Have a good weekend, > Raga > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka > wrote: > >> Hi, Raga, >> >> sounds good, but I am wondering a bit about the order. 2) should come >> before 1), right? Because model selection is basically done via hyperparam >> optimization. >> >> Not saying that this is the optimal/right approach, but I usually do it >> like this: >> >> 1.) algo selection via nested cv >> 2.) model selection based on best algo via k-fold on whole training set >> 3.) fit best algo w. best hyperparams (from 2.) to whole training set >> 4.) evaluate on test set >> 5.) fit classifier to whole dataset, done >> >> Best, >> Sebastian >> >> > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < >> mail at sebastianraschka.com> wrote: >> > >> > Hi, Raga, >> > >> > sounds good, but I am wondering a bit about the order. 2) should come >> before 1), right? Because model selection is basically done via hyperparam >> optimization. >> > >> > Not saying that this is the optimal/right approach, but I usually do it >> like this: >> > >> > 1.) algo selection via nested cv >> > 2.) model selection based on best algo via k-fold on whole training set >> > 3.) fit best algo w. best hyperparams (from 2.) to whole training set >> > 4.) evaluate on test set >> > 5.) fit classifier to whole dataset, done >> > >> > Best, >> > Sebastian >> > >> >> On Jan 27, 2017, at 10:23 AM, Raga Markely >> wrote: >> >> >> >> Sounds good, Sebastian.. thanks for the suggestions.. >> >> >> >> My dataset is relatively small (only ~35 samples), and this is the >> workflow I have set up so far.. >> >> 1. Model selection: use nested loop using >> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn >> page that you provided - the results show no statistically significant >> difference in accuracy mean +/- SD among classifiers.. this is expected as >> the pattern is pretty obvious and simple to separate by eyes after >> dimensionality reduction (I use pipeline of stdscaler, LDA, and >> classifier)... so i take all of them and use voting classifier in step #3.. >> >> 2. Hyperparameter optimization: use GridSearchCV to optimize >> hyperparameters of each classifiers >> >> 3. Decision Region: use the hyperparameters from step #2, fit each >> classifier separately to the whole dataset, and use voting classifier to >> get decision region >> >> >> >> This sounds reasonable? >> >> >> >> Thank you very much! >> >> Raga >> >> >> >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < >> se.raschka at gmail.com> wrote: >> >> You are welcome! And in addition, if you select among different >> algorithms, here are some more suggestions >> >> >> >> a) don?t do it based on your independent test set if this is going to >> your final model performance estimate, or be aware that it would be overly >> optimistic >> >> b) also, it?s not the best idea to select algorithms using >> cross-validation on the same training set that you used for model >> selection; a more robust way would be nested CV (e.g,. >> http://scikit-learn.org/stable/auto_examples/model_selection >> /plot_nested_cross_validation_iris.html) >> >> >> >> But yeah, it all depends on your dataset and size. If you have a >> neural net that takes week to train, and if you have a large dataset anyway >> so that you can set aside large sets for testing, I?d train on >> train/validation splits and evaluate on the test set. And to compare e.g., >> two networks against each other on large test sets, you could do a McNemar >> test. >> >> >> >> Best, >> >> Sebastian >> >> >> >>> On Jan 26, 2017, at 8:09 PM, Raga Markely >> wrote: >> >>> >> >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! >> >>> >> >>> Best, >> >>> Raga >> >>> >> >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < >> se.raschka at gmail.com> wrote: >> >>> Hi, Raga, >> >>> >> >>> I think that if GridSearchCV is used for classification, the >> stratified k-fold doesn?t do shuffling by default. >> >>> >> >>> Say you do 20 grid search repetitions, you could then do sth like: >> >>> >> >>> >> >>> from sklearn.model_selection import StratifiedKFold >> >>> >> >>> for i in range(n_reps): >> >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) >> >>> gs = GridSearchCV(..., cv=k_fold) >> >>> ... >> >>> >> >>> Best, >> >>> Sebastian >> >>> >> >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely >> wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought >> that each time I call GridSearchCV, the training and test sets separated in >> different splits would be different. >> >>>> >> >>>> However, I got the same best_params_ and best_scores_ for all 20 >> repeats. It looks like the training and test sets are separated in >> identical folds in each run? Just to clarify, e.g. I have the following >> data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = >> 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I >> couldn't get [1,3] [0,2,4] or other combinations. >> >>>> >> >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I >> enter cv = integer. The StratifiedKFold command has random state; I wonder >> if there is anyway I can make the the training and test sets randomly >> separated each time I call the GridSearchCV? >> >>>> >> >>>> Just a note, I used the following classifiers: Logistic Regression, >> KNN, SVC, Kernel SVC, Random Forest, and had the same observation >> regardless of the classifiers. >> >>>> >> >>>> Thank you very much! >> >>>> Raga >> >>>> >> >>>> _______________________________________________ >> >>>> scikit-learn mailing list >> >>>> scikit-learn at python.org >> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Mon Jan 30 15:25:46 2017 From: nfliu at uw.edu (Nelson Liu) Date: Mon, 30 Jan 2017 20:25:46 +0000 Subject: [scikit-learn] GSOC call for mentors In-Reply-To: References: Message-ID: Hey all, I'd be willing to help out with mentoring a project as well, hopefully in tandem with someone else. Nelson Liu On Mon, Jan 30, 2017 at 10:10 AM Jacob Schreiber wrote: > I discussed this briefly with Gael and Joel. The consensus was that unless > we already know excellent students who will fit well that it is unlikely we > will participate in GSoC. That being said, if someone (other than me) is > willing to step up and organize it, I'd volunteer to be a mentor again. I > think an important project would be adding multithreading to individual > tree building so we can do gradient boosting in parallel. > > On Mon, Jan 30, 2017 at 5:38 AM, Andreas Mueller wrote: > > Hey all. > It's that time of the year again. > Are we planning on participating in GSOC? > If so, we need mentors and projects. > It's unlikely that I'll have time to help with either in any substantial > way. > If we want to participate, I think we should try to be a bit more > organized than last year ;) > > Andy > > Sent from phone. Please excuse spelling and brevity. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Jan 30 15:37:57 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 30 Jan 2017 15:37:57 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Message-ID: Hm, which version of scikit-learn are you using? Are you running this on sklearn 0.18? Best, Sebastian > On Jan 30, 2017, at 2:48 PM, Raga Markely wrote: > > Hi Sebastian, > > Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: > N_outer=10 > N_inner=10 > scores=[] > for i in range(N_outer): > k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) > for j in range(N_inner): > k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) > gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) > score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) > scores.append(score) > np.mean(scores) > np.std(scores) > > But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable > > I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... > Could you give me some tips on what I can do? > > Thank you! > Raga > > > > On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely wrote: > Hi Sebastian, > > Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow.. > > Thank you very much for your help! > > Have a good weekend, > Raga > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka wrote: > Hi, Raga, > > sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. > > Not saying that this is the optimal/right approach, but I usually do it like this: > > 1.) algo selection via nested cv > 2.) model selection based on best algo via k-fold on whole training set > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > 4.) evaluate on test set > 5.) fit classifier to whole dataset, done > > Best, > Sebastian > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka wrote: > > > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely wrote: > >> > >> Sounds good, Sebastian.. thanks for the suggestions.. > >> > >> My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. > >> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. > >> 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers > >> 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region > >> > >> This sounds reasonable? > >> > >> Thank you very much! > >> Raga > >> > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka wrote: > >> You are welcome! And in addition, if you select among different algorithms, here are some more suggestions > >> > >> a) don?t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic > >> b) also, it?s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) > >> > >> But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I?d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. > >> > >> Best, > >> Sebastian > >> > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely wrote: > >>> > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > >>> > >>> Best, > >>> Raga > >>> > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: > >>> Hi, Raga, > >>> > >>> I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. > >>> > >>> Say you do 20 grid search repetitions, you could then do sth like: > >>> > >>> > >>> from sklearn.model_selection import StratifiedKFold > >>> > >>> for i in range(n_reps): > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > >>> gs = GridSearchCV(..., cv=k_fold) > >>> ... > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: > >>>> > >>>> Hello, > >>>> > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. > >>>> > >>>> However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. > >>>> > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? > >>>> > >>>> Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. > >>>> > >>>> Thank you very much! > >>>> Raga > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Mon Jan 30 15:49:26 2017 From: raga.markely at gmail.com (Raga Markely) Date: Mon, 30 Jan 2017 15:49:26 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Message-ID: Nice catch!! The sklearn was 0.18, but I used sklearn.grid_search instead of sklearn.model_selection. Error is gone now. Thank you, Sebastian! Raga On Mon, Jan 30, 2017 at 3:37 PM, Sebastian Raschka wrote: > Hm, which version of scikit-learn are you using? Are you running this on > sklearn 0.18? > > Best, > Sebastian > > > On Jan 30, 2017, at 2:48 PM, Raga Markely > wrote: > > > > Hi Sebastian, > > > > Following up on the original question on repeated Grid Search CV, I > tried to do repeated nested loop using the followings: > > N_outer=10 > > N_inner=10 > > scores=[] > > for i in range(N_outer): > > k_fold_outer = StratifiedKFold(n_splits=10, > shuffle=True,random_state=i) > > for j in range(N_inner): > > k_fold_inner = StratifiedKFold(n_splits=10, > shuffle=True,random_state=j) > > gs = GridSearchCV(estimator=pipe_svc, > param_grid=param_grid,cv=k_fold_inner) > > score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) > > scores.append(score) > > np.mean(scores) > > np.std(scores) > > > > But, I get the following error: TypeError: 'StratifiedKFold' object is > not iterable > > > > I did some trials, and the error is gone when I remove cv=k_fold_inner > from gs = ... > > Could you give me some tips on what I can do? > > > > Thank you! > > Raga > > > > > > > > On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely > wrote: > > Hi Sebastian, > > > > Sorry, I used the wrong terms (I was referring to algo as model).. great > then, i think what i have is aligned with your workflow.. > > > > Thank you very much for your help! > > > > Have a good weekend, > > Raga > > > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka > wrote: > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > > > > Hi, Raga, > > > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > > > Not saying that this is the optimal/right approach, but I usually do > it like this: > > > > > > 1.) algo selection via nested cv > > > 2.) model selection based on best algo via k-fold on whole training set > > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > > 4.) evaluate on test set > > > 5.) fit classifier to whole dataset, done > > > > > > Best, > > > Sebastian > > > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely > wrote: > > >> > > >> Sounds good, Sebastian.. thanks for the suggestions.. > > >> > > >> My dataset is relatively small (only ~35 samples), and this is the > workflow I have set up so far.. > > >> 1. Model selection: use nested loop using > cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn > page that you provided - the results show no statistically significant > difference in accuracy mean +/- SD among classifiers.. this is expected as > the pattern is pretty obvious and simple to separate by eyes after > dimensionality reduction (I use pipeline of stdscaler, LDA, and > classifier)... so i take all of them and use voting classifier in step #3.. > > >> 2. Hyperparameter optimization: use GridSearchCV to optimize > hyperparameters of each classifiers > > >> 3. Decision Region: use the hyperparameters from step #2, fit each > classifier separately to the whole dataset, and use voting classifier to > get decision region > > >> > > >> This sounds reasonable? > > >> > > >> Thank you very much! > > >> Raga > > >> > > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > >> You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > > >> > > >> a) don?t do it based on your independent test set if this is going to > your final model performance estimate, or be aware that it would be overly > optimistic > > >> b) also, it?s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > > >> > > >> But yeah, it all depends on your dataset and size. If you have a > neural net that takes week to train, and if you have a large dataset anyway > so that you can set aside large sets for testing, I?d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely > wrote: > > >>> > > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > >>> > > >>> Best, > > >>> Raga > > >>> > > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > >>> Hi, Raga, > > >>> > > >>> I think that if GridSearchCV is used for classification, the > stratified k-fold doesn?t do shuffling by default. > > >>> > > >>> Say you do 20 grid search repetitions, you could then do sth like: > > >>> > > >>> > > >>> from sklearn.model_selection import StratifiedKFold > > >>> > > >>> for i in range(n_reps): > > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > >>> gs = GridSearchCV(..., cv=k_fold) > > >>> ... > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely > wrote: > > >>>> > > >>>> Hello, > > >>>> > > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > >>>> > > >>>> However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > >>>> > > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > >>>> > > >>>> Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > > >>>> > > >>>> Thank you very much! > > >>>> Raga > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Jan 30 16:04:49 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 30 Jan 2017 16:04:49 -0500 Subject: [scikit-learn] Random StratifiedKFold Grid Search CV In-Reply-To: References: <91E91A60-8B9C-44E1-85E4-5DB1CB86EBDC@gmail.com> <43515838-969C-495F-8C22-BEB30C04D1DD@sebastianraschka.com> <5EF61074-E96F-4EA8-BA5E-7C4B07505D7B@gmail.com> Message-ID: Cool, glad to hear that it was such an easy fix :) > On Jan 30, 2017, at 3:49 PM, Raga Markely wrote: > > Nice catch!! The sklearn was 0.18, but I used sklearn.grid_search instead of sklearn.model_selection. > > Error is gone now. > > Thank you, Sebastian! > Raga > > On Mon, Jan 30, 2017 at 3:37 PM, Sebastian Raschka wrote: > Hm, which version of scikit-learn are you using? Are you running this on sklearn 0.18? > > Best, > Sebastian > > > On Jan 30, 2017, at 2:48 PM, Raga Markely wrote: > > > > Hi Sebastian, > > > > Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: > > N_outer=10 > > N_inner=10 > > scores=[] > > for i in range(N_outer): > > k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) > > for j in range(N_inner): > > k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) > > gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) > > score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) > > scores.append(score) > > np.mean(scores) > > np.std(scores) > > > > But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable > > > > I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... > > Could you give me some tips on what I can do? > > > > Thank you! > > Raga > > > > > > > > On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely wrote: > > Hi Sebastian, > > > > Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow.. > > > > Thank you very much for your help! > > > > Have a good weekend, > > Raga > > > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka wrote: > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka wrote: > > > > > > Hi, Raga, > > > > > > sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. > > > > > > Not saying that this is the optimal/right approach, but I usually do it like this: > > > > > > 1.) algo selection via nested cv > > > 2.) model selection based on best algo via k-fold on whole training set > > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > > 4.) evaluate on test set > > > 5.) fit classifier to whole dataset, done > > > > > > Best, > > > Sebastian > > > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely wrote: > > >> > > >> Sounds good, Sebastian.. thanks for the suggestions.. > > >> > > >> My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. > > >> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. > > >> 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers > > >> 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region > > >> > > >> This sounds reasonable? > > >> > > >> Thank you very much! > > >> Raga > > >> > > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka wrote: > > >> You are welcome! And in addition, if you select among different algorithms, here are some more suggestions > > >> > > >> a) don?t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic > > >> b) also, it?s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) > > >> > > >> But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I?d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely wrote: > > >>> > > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > >>> > > >>> Best, > > >>> Raga > > >>> > > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka wrote: > > >>> Hi, Raga, > > >>> > > >>> I think that if GridSearchCV is used for classification, the stratified k-fold doesn?t do shuffling by default. > > >>> > > >>> Say you do 20 grid search repetitions, you could then do sth like: > > >>> > > >>> > > >>> from sklearn.model_selection import StratifiedKFold > > >>> > > >>> for i in range(n_reps): > > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > >>> gs = GridSearchCV(..., cv=k_fold) > > >>> ... > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely wrote: > > >>>> > > >>>> Hello, > > >>>> > > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. > > >>>> > > >>>> However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. > > >>>> > > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? > > >>>> > > >>>> Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. > > >>>> > > >>>> Thank you very much! > > >>>> Raga > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From g.lemaitre58 at gmail.com Tue Jan 31 12:55:59 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 31 Jan 2017 18:55:59 +0100 Subject: [scikit-learn] GSOC call for mentors In-Reply-To: References: Message-ID: I would be interested in helping for mentoring or whatever is needed regarding the project. On 30 January 2017 at 21:25, Nelson Liu wrote: > Hey all, > I'd be willing to help out with mentoring a project as well, hopefully in > tandem with someone else. > > Nelson Liu > > On Mon, Jan 30, 2017 at 10:10 AM Jacob Schreiber > wrote: > >> I discussed this briefly with Gael and Joel. The consensus was that >> unless we already know excellent students who will fit well that it is >> unlikely we will participate in GSoC. That being said, if someone (other >> than me) is willing to step up and organize it, I'd volunteer to be a >> mentor again. I think an important project would be adding multithreading >> to individual tree building so we can do gradient boosting in parallel. >> >> On Mon, Jan 30, 2017 at 5:38 AM, Andreas Mueller >> wrote: >> >> Hey all. >> It's that time of the year again. >> Are we planning on participating in GSOC? >> If so, we need mentors and projects. >> It's unlikely that I'll have time to help with either in any substantial >> way. >> If we want to participate, I think we should try to be a bit more >> organized than last year ;) >> >> Andy >> >> Sent from phone. Please excuse spelling and brevity. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: