[scikit-learn] Random StratifiedKFold Grid Search CV

Sebastian Raschka mail at sebastianraschka.com
Fri Jan 27 12:49:50 EST 2017


Hi, Raga,

sounds good, but I am wondering a bit about the order. 2) should come before 1), right? Because model selection is basically done via hyperparam optimization.

Not saying that this is the optimal/right approach, but I usually do it like this:

1.) algo selection via nested cv
2.) model selection based on best algo via k-fold on whole training set
3.) fit best algo w. best hyperparams (from 2.) to whole training set
4.) evaluate on test set
5.) fit classifier to whole dataset, done

Best,
Sebastian

> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.markely at gmail.com> wrote:
> 
> Sounds good, Sebastian.. thanks for the suggestions..
> 
> My dataset is relatively small (only ~35 samples), and this is the workflow I have set up so far.. 
> 1. Model selection: use nested loop using cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page that you provided - the results show no statistically significant difference in accuracy mean +/- SD among classifiers.. this is expected as the pattern is pretty obvious and simple to separate by eyes after dimensionality reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all of them and use voting classifier in step #3..
> 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters of each classifiers
> 3. Decision Region: use the hyperparameters from step #2, fit each classifier separately to the whole dataset, and use voting classifier to get decision region
> 
> This sounds reasonable?
> 
> Thank you very much!
> Raga
> 
> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> You are welcome! And in addition, if you select among different algorithms, here are some more suggestions
> 
> a) don’t do it based on your independent test set if this is going to your final model performance estimate, or be aware that it would be overly optimistic
> b) also, it’s not the best idea to select algorithms using cross-validation on the same training set that you used for model selection; a more robust way would be nested CV (e.g,. http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
> 
> But yeah, it all depends on your dataset and size. If you have a neural net that takes week to train, and if you have a large dataset anyway so that you can set aside large sets for testing, I’d train on train/validation splits and evaluate on the test set. And to compare e.g., two networks against each other on large test sets, you could do a McNemar test.
> 
> Best,
> Sebastian
> 
> > On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely at gmail.com> wrote:
> >
> > Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
> >
> > Best,
> > Raga
> >
> > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Hi, Raga,
> >
> > I think that if GridSearchCV is used for classification, the stratified k-fold doesn’t do shuffling by default.
> >
> > Say you do 20 grid search repetitions, you could then do sth like:
> >
> >
> > from sklearn.model_selection import StratifiedKFold
> >
> > for i in range(n_reps):
> >     k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
> >     gs = GridSearchCV(..., cv=k_fold)
> >     ...
> >
> > Best,
> > Sebastian
> >
> > > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely at gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I was trying to do repeated Grid Search CV (20 repeats). I thought that each time I call GridSearchCV, the training and test sets separated in different splits would be different.
> > >
> > > However, I got the same best_params_ and best_scores_ for all 20 repeats. It looks like the training and test sets are separated in identical folds in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
> > >
> > > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter cv = integer. The StratifiedKFold command has random state; I wonder if there is anyway I can make the the training and test sets randomly separated each time I call the GridSearchCV?
> > >
> > > Just a note, I used the following classifiers: Logistic Regression, KNN, SVC, Kernel SVC, Random Forest, and had the same observation regardless of the classifiers.
> > >
> > > Thank you very much!
> > > Raga
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list