Random StratifiedKFold Grid Search CV
Hello, I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different. However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations. If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV? Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers. Thank you very much! Raga
Hi, Raga, I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default. Say you do 20 grid search repetitions, you could then do sth like: from sklearn.model_selection import StratifiedKFold for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ... Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Ahh.. nice.. I will use that.. thanks a lot, Sebastian! Best, Raga On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: > Hi, Raga, > > I think that if GridSearchCV is used for classification, the stratified > k-fold doesn’t do shuffling by default. > > Say you do 20 grid search repetitions, you could then do sth like: > > > from sklearn.model_selection import StratifiedKFold > > for i in range(n_reps): > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > gs = GridSearchCV(..., cv=k_fold) > ... > > Best, > Sebastian > > > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > > > Hello, > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought that > each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > > > However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > > > Just a note, I used the following classifiers: Logistic Regression, KNN, > SVC, Kernel SVC, Random Forest, and had the same observation regardless of > the classifiers. > > > > Thank you very much! > > Raga > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
You are welcome! And in addition, if you select among different algorithms, here are some more suggestions a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cro...) But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test. Best, Sebastian
On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> wrote:
Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
Best, Raga
On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
Say you do 20 grid search repetitions, you could then do sth like:
from sklearn.model_selection import StratifiedKFold
for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ...
Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Sounds good, Sebastian.. thanks for the suggestions.. My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region This sounds reasonable? Thank you very much! Raga On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: > You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > > a) don’t do it based on your independent test set if this is going to your > final model performance estimate, or be aware that it would be overly > optimistic > b) also, it’s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > > But yeah, it all depends on your dataset and size. If you have a neural > net that takes week to train, and if you have a large dataset anyway so > that you can set aside large sets for testing, I’d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > > Best, > Sebastian > > > On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > > > Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > > > Best, > > Raga > > > > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> > wrote: > > Hi, Raga, > > > > I think that if GridSearchCV is used for classification, the stratified > k-fold doesn’t do shuffling by default. > > > > Say you do 20 grid search repetitions, you could then do sth like: > > > > > > from sklearn.model_selection import StratifiedKFold > > > > for i in range(n_reps): > > k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > gs = GridSearchCV(..., cv=k_fold) > > ... > > > > Best, > > Sebastian > > > > > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > > > > > Hello, > > > > > > I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > > > > > However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > > > > > If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > > > > > Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > > > > > > Thank you very much! > > > Raga > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
Hi, Raga, sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. Not saying that this is the optimal/right approach, but I usually do it like this: 1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done Best, Sebastian
On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> wrote:
Sounds good, Sebastian.. thanks for the suggestions..
My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region
This sounds reasonable?
Thank you very much! Raga
On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: You are welcome! And in addition, if you select among different algorithms, here are some more suggestions
a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cro...)
But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test.
Best, Sebastian
On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> wrote:
Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
Best, Raga
On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
Say you do 20 grid search repetitions, you could then do sth like:
from sklearn.model_selection import StratifiedKFold
for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ...
Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi, Raga, sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization. Not saying that this is the optimal/right approach, but I usually do it like this: 1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done Best, Sebastian
On Jan 27, 2017, at 12:49 PM, Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hi, Raga,
sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.
Not saying that this is the optimal/right approach, but I usually do it like this:
1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done
Best, Sebastian
On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> wrote:
Sounds good, Sebastian.. thanks for the suggestions..
My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region
This sounds reasonable?
Thank you very much! Raga
On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: You are welcome! And in addition, if you select among different algorithms, here are some more suggestions
a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cro...)
But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test.
Best, Sebastian
On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> wrote:
Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
Best, Raga
On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
Say you do 20 grid search repetitions, you could then do sth like:
from sklearn.model_selection import StratifiedKFold
for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ...
Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Sebastian, Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow.. Thank you very much for your help! Have a good weekend, Raga On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: > Hi, Raga, > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > 1.) algo selection via nested cv > 2.) model selection based on best algo via k-fold on whole training set > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > 4.) evaluate on test set > 5.) fit classifier to whole dataset, done > > Best, > Sebastian > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < > mail@sebastianraschka.com> wrote: > > > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> > wrote: > >> > >> Sounds good, Sebastian.. thanks for the suggestions.. > >> > >> My dataset is relatively small (only ~35 samples), and this is the > workflow I have set up so far.. > >> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) > same as shown in the scikit-learn page that you provided - the results show > no statistically significant difference in accuracy mean +/- SD among > classifiers.. this is expected as the pattern is pretty obvious and simple > to separate by eyes after dimensionality reduction (I use pipeline of > stdscaler, LDA, and classifier)... so i take all of them and use voting > classifier in step #3.. > >> 2. Hyperparameter optimization: use GridSearchCV to optimize > hyperparameters of each classifiers > >> 3. Decision Region: use the hyperparameters from step #2, fit each > classifier separately to the whole dataset, and use voting classifier to > get decision region > >> > >> This sounds reasonable? > >> > >> Thank you very much! > >> Raga > >> > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < > se.raschka@gmail.com> wrote: > >> You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > >> > >> a) don’t do it based on your independent test set if this is going to > your final model performance estimate, or be aware that it would be overly > optimistic > >> b) also, it’s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > >> > >> But yeah, it all depends on your dataset and size. If you have a neural > net that takes week to train, and if you have a large dataset anyway so > that you can set aside large sets for testing, I’d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > >> > >> Best, > >> Sebastian > >> > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> > wrote: > >>> > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > >>> > >>> Best, > >>> Raga > >>> > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < > se.raschka@gmail.com> wrote: > >>> Hi, Raga, > >>> > >>> I think that if GridSearchCV is used for classification, the > stratified k-fold doesn’t do shuffling by default. > >>> > >>> Say you do 20 grid search repetitions, you could then do sth like: > >>> > >>> > >>> from sklearn.model_selection import StratifiedKFold > >>> > >>> for i in range(n_reps): > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > >>> gs = GridSearchCV(..., cv=k_fold) > >>> ... > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> > wrote: > >>>> > >>>> Hello, > >>>> > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > >>>> > >>>> However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > >>>> > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > >>>> > >>>> Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > >>>> > >>>> Thank you very much! > >>>> Raga > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn@python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn@python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn@python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
Hi Sebastian,
Following up on the original question on repeated Grid Search CV, I tried
to do repeated nested loop using the followings:
N_outer=10
N_inner=10
scores=[]
for i in range(N_outer):
k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i)
for j in range(N_inner):
k_fold_inner =
StratifiedKFold(n_splits=10,shuffle=True,random_state=j)
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,cv=k_fold_inner)
score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer)
scores.append(score)
np.mean(scores)
np.std(scores)
But, I get the following error: TypeError: 'StratifiedKFold' object is not
iterable
I did some trials, and the error is gone when I remove cv=k_fold_inner from
gs = ...
Could you give me some tips on what I can do?
Thank you!
Raga
On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.markely@gmail.com>
wrote:
> Hi Sebastian,
>
> Sorry, I used the wrong terms (I was referring to algo as model).. great
> then, i think what i have is aligned with your workflow..
>
> Thank you very much for your help!
>
> Have a good weekend,
> Raga
>
> On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.raschka@gmail.com>
> wrote:
>
>> Hi, Raga,
>>
>> sounds good, but I am wondering a bit about the order. 2) should come
>> before 1), right? Because model selection is basically done via hyperparam
>> optimization.
>>
>> Not saying that this is the optimal/right approach, but I usually do it
>> like this:
>>
>> 1.) algo selection via nested cv
>> 2.) model selection based on best algo via k-fold on whole training set
>> 3.) fit best algo w. best hyperparams (from 2.) to whole training set
>> 4.) evaluate on test set
>> 5.) fit classifier to whole dataset, done
>>
>> Best,
>> Sebastian
>>
>> > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka <
>> mail@sebastianraschka.com> wrote:
>> >
>> > Hi, Raga,
>> >
>> > sounds good, but I am wondering a bit about the order. 2) should come
>> before 1), right? Because model selection is basically done via hyperparam
>> optimization.
>> >
>> > Not saying that this is the optimal/right approach, but I usually do it
>> like this:
>> >
>> > 1.) algo selection via nested cv
>> > 2.) model selection based on best algo via k-fold on whole training set
>> > 3.) fit best algo w. best hyperparams (from 2.) to whole training set
>> > 4.) evaluate on test set
>> > 5.) fit classifier to whole dataset, done
>> >
>> > Best,
>> > Sebastian
>> >
>> >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com>
>> wrote:
>> >>
>> >> Sounds good, Sebastian.. thanks for the suggestions..
>> >>
>> >> My dataset is relatively small (only ~35 samples), and this is the
>> workflow I have set up so far..
>> >> 1. Model selection: use nested loop using
>> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn
>> page that you provided - the results show no statistically significant
>> difference in accuracy mean +/- SD among classifiers.. this is expected as
>> the pattern is pretty obvious and simple to separate by eyes after
>> dimensionality reduction (I use pipeline of stdscaler, LDA, and
>> classifier)... so i take all of them and use voting classifier in step #3..
>> >> 2. Hyperparameter optimization: use GridSearchCV to optimize
>> hyperparameters of each classifiers
>> >> 3. Decision Region: use the hyperparameters from step #2, fit each
>> classifier separately to the whole dataset, and use voting classifier to
>> get decision region
>> >>
>> >> This sounds reasonable?
>> >>
>> >> Thank you very much!
>> >> Raga
>> >>
>> >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <
>> se.raschka@gmail.com> wrote:
>> >> You are welcome! And in addition, if you select among different
>> algorithms, here are some more suggestions
>> >>
>> >> a) don’t do it based on your independent test set if this is going to
>> your final model performance estimate, or be aware that it would be overly
>> optimistic
>> >> b) also, it’s not the best idea to select algorithms using
>> cross-validation on the same training set that you used for model
>> selection; a more robust way would be nested CV (e.g,.
>> http://scikit-learn.org/stable/auto_examples/model_selection
>> /plot_nested_cross_validation_iris.html)
>> >>
>> >> But yeah, it all depends on your dataset and size. If you have a
>> neural net that takes week to train, and if you have a large dataset anyway
>> so that you can set aside large sets for testing, I’d train on
>> train/validation splits and evaluate on the test set. And to compare e.g.,
>> two networks against each other on large test sets, you could do a McNemar
>> test.
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com>
>> wrote:
>> >>>
>> >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
>> >>>
>> >>> Best,
>> >>> Raga
>> >>>
>> >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <
>> se.raschka@gmail.com> wrote:
>> >>> Hi, Raga,
>> >>>
>> >>> I think that if GridSearchCV is used for classification, the
>> stratified k-fold doesn’t do shuffling by default.
>> >>>
>> >>> Say you do 20 grid search repetitions, you could then do sth like:
>> >>>
>> >>>
>> >>> from sklearn.model_selection import StratifiedKFold
>> >>>
>> >>> for i in range(n_reps):
>> >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
>> >>> gs = GridSearchCV(..., cv=k_fold)
>> >>> ...
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>>
>> >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought
>> that each time I call GridSearchCV, the training and test sets separated in
>> different splits would be different.
>> >>>>
>> >>>> However, I got the same best_params_ and best_scores_ for all 20
>> repeats. It looks like the training and test sets are separated in
>> identical folds in each run? Just to clarify, e.g. I have the following
>> data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv =
>> 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I
>> couldn't get [1,3] [0,2,4] or other combinations.
>> >>>>
>> >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I
>> enter cv = integer. The StratifiedKFold command has random state; I wonder
>> if there is anyway I can make the the training and test sets randomly
>> separated each time I call the GridSearchCV?
>> >>>>
>> >>>> Just a note, I used the following classifiers: Logistic Regression,
>> KNN, SVC, Kernel SVC, Random Forest, and had the same observation
>> regardless of the classifiers.
>> >>>>
>> >>>> Thank you very much!
>> >>>> Raga
>> >>>>
>> >>>> _______________________________________________
>> >>>> scikit-learn mailing list
>> >>>> scikit-learn@python.org
>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
Hm, which version of scikit-learn are you using? Are you running this on sklearn 0.18? Best, Sebastian
On Jan 30, 2017, at 2:48 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hi Sebastian,
Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: N_outer=10 N_inner=10 scores=[] for i in range(N_outer): k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) for j in range(N_inner): k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) scores.append(score) np.mean(scores) np.std(scores)
But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable
I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... Could you give me some tips on what I can do?
Thank you! Raga
On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.markely@gmail.com> wrote: Hi Sebastian,
Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow..
Thank you very much for your help!
Have a good weekend, Raga
On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.
Not saying that this is the optimal/right approach, but I usually do it like this:
1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done
Best, Sebastian
On Jan 27, 2017, at 12:49 PM, Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hi, Raga,
sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.
Not saying that this is the optimal/right approach, but I usually do it like this:
1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done
Best, Sebastian
On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> wrote:
Sounds good, Sebastian.. thanks for the suggestions..
My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region
This sounds reasonable?
Thank you very much! Raga
On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: You are welcome! And in addition, if you select among different algorithms, here are some more suggestions
a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cro...)
But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test.
Best, Sebastian
On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> wrote:
Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
Best, Raga
On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
Say you do 20 grid search repetitions, you could then do sth like:
from sklearn.model_selection import StratifiedKFold
for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ...
Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Nice catch!! The sklearn was 0.18, but I used sklearn.grid_search instead of sklearn.model_selection. Error is gone now. Thank you, Sebastian! Raga On Mon, Jan 30, 2017 at 3:37 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: > Hm, which version of scikit-learn are you using? Are you running this on > sklearn 0.18? > > Best, > Sebastian > > > On Jan 30, 2017, at 2:48 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > > > Hi Sebastian, > > > > Following up on the original question on repeated Grid Search CV, I > tried to do repeated nested loop using the followings: > > N_outer=10 > > N_inner=10 > > scores=[] > > for i in range(N_outer): > > k_fold_outer = StratifiedKFold(n_splits=10, > shuffle=True,random_state=i) > > for j in range(N_inner): > > k_fold_inner = StratifiedKFold(n_splits=10, > shuffle=True,random_state=j) > > gs = GridSearchCV(estimator=pipe_svc, > param_grid=param_grid,cv=k_fold_inner) > > score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) > > scores.append(score) > > np.mean(scores) > > np.std(scores) > > > > But, I get the following error: TypeError: 'StratifiedKFold' object is > not iterable > > > > I did some trials, and the error is gone when I remove cv=k_fold_inner > from gs = ... > > Could you give me some tips on what I can do? > > > > Thank you! > > Raga > > > > > > > > On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > Hi Sebastian, > > > > Sorry, I used the wrong terms (I was referring to algo as model).. great > then, i think what i have is aligned with your workflow.. > > > > Thank you very much for your help! > > > > Have a good weekend, > > Raga > > > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.raschka@gmail.com> > wrote: > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < > mail@sebastianraschka.com> wrote: > > > > > > Hi, Raga, > > > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > > > Not saying that this is the optimal/right approach, but I usually do > it like this: > > > > > > 1.) algo selection via nested cv > > > 2.) model selection based on best algo via k-fold on whole training set > > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > > 4.) evaluate on test set > > > 5.) fit classifier to whole dataset, done > > > > > > Best, > > > Sebastian > > > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> > wrote: > > >> > > >> Sounds good, Sebastian.. thanks for the suggestions.. > > >> > > >> My dataset is relatively small (only ~35 samples), and this is the > workflow I have set up so far.. > > >> 1. Model selection: use nested loop using > cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn > page that you provided - the results show no statistically significant > difference in accuracy mean +/- SD among classifiers.. this is expected as > the pattern is pretty obvious and simple to separate by eyes after > dimensionality reduction (I use pipeline of stdscaler, LDA, and > classifier)... so i take all of them and use voting classifier in step #3.. > > >> 2. Hyperparameter optimization: use GridSearchCV to optimize > hyperparameters of each classifiers > > >> 3. Decision Region: use the hyperparameters from step #2, fit each > classifier separately to the whole dataset, and use voting classifier to > get decision region > > >> > > >> This sounds reasonable? > > >> > > >> Thank you very much! > > >> Raga > > >> > > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < > se.raschka@gmail.com> wrote: > > >> You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > > >> > > >> a) don’t do it based on your independent test set if this is going to > your final model performance estimate, or be aware that it would be overly > optimistic > > >> b) also, it’s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > > >> > > >> But yeah, it all depends on your dataset and size. If you have a > neural net that takes week to train, and if you have a large dataset anyway > so that you can set aside large sets for testing, I’d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > > >> > > >> Best, > > >> Sebastian > > >> > > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > >>> > > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > > >>> > > >>> Best, > > >>> Raga > > >>> > > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < > se.raschka@gmail.com> wrote: > > >>> Hi, Raga, > > >>> > > >>> I think that if GridSearchCV is used for classification, the > stratified k-fold doesn’t do shuffling by default. > > >>> > > >>> Say you do 20 grid search repetitions, you could then do sth like: > > >>> > > >>> > > >>> from sklearn.model_selection import StratifiedKFold > > >>> > > >>> for i in range(n_reps): > > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > > >>> gs = GridSearchCV(..., cv=k_fold) > > >>> ... > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> > wrote: > > >>>> > > >>>> Hello, > > >>>> > > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > > >>>> > > >>>> However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > > >>>> > > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > > >>>> > > >>>> Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > > >>>> > > >>>> Thank you very much! > > >>>> Raga > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn@python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn@python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn@python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn@python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn@python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
Cool, glad to hear that it was such an easy fix :)
On Jan 30, 2017, at 3:49 PM, Raga Markely <raga.markely@gmail.com> wrote:
Nice catch!! The sklearn was 0.18, but I used sklearn.grid_search instead of sklearn.model_selection.
Error is gone now.
Thank you, Sebastian! Raga
On Mon, Jan 30, 2017 at 3:37 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hm, which version of scikit-learn are you using? Are you running this on sklearn 0.18?
Best, Sebastian
On Jan 30, 2017, at 2:48 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hi Sebastian,
Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: N_outer=10 N_inner=10 scores=[] for i in range(N_outer): k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) for j in range(N_inner): k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) scores.append(score) np.mean(scores) np.std(scores)
But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable
I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... Could you give me some tips on what I can do?
Thank you! Raga
On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.markely@gmail.com> wrote: Hi Sebastian,
Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow..
Thank you very much for your help!
Have a good weekend, Raga
On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.
Not saying that this is the optimal/right approach, but I usually do it like this:
1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done
Best, Sebastian
On Jan 27, 2017, at 12:49 PM, Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hi, Raga,
sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.
Not saying that this is the optimal/right approach, but I usually do it like this:
1.) algo selection via nested cv 2.) model selection based on best algo via k-fold on whole training set 3.) fit best algo w. best hyperparams (from 2.) to whole training set 4.) evaluate on test set 5.) fit classifier to whole dataset, done
Best, Sebastian
On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely@gmail.com> wrote:
Sounds good, Sebastian.. thanks for the suggestions..
My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3.. 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region
This sounds reasonable?
Thank you very much! Raga
On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: You are welcome! And in addition, if you select among different algorithms, here are some more suggestions
a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cro...)
But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test.
Best, Sebastian
On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely@gmail.com> wrote:
Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
Best, Raga
On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka@gmail.com> wrote: Hi, Raga,
I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
Say you do 20 grid search repetitions, you could then do sth like:
from sklearn.model_selection import StratifiedKFold
for i in range(n_reps): k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) gs = GridSearchCV(..., cv=k_fold) ...
Best, Sebastian
On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely@gmail.com> wrote:
Hello,
I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
Thank you very much! Raga
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Raga Markely -
Sebastian Raschka -
Sebastian Raschka