[scikit-learn] Random StratifiedKFold Grid Search CV

Raga Markely raga.markely at gmail.com
Fri Jan 27 10:23:42 EST 2017


Sounds good, Sebastian.. thanks for the suggestions..

My dataset is relatively small (only ~35 samples), and this is the workflow
I have set up so far..
1. Model selection: use nested loop using
cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn
page that you provided - the results show no statistically significant
difference in accuracy mean +/- SD among classifiers.. this is expected as
the pattern is pretty obvious and simple to separate by eyes after
dimensionality reduction (I use pipeline of stdscaler, LDA, and
classifier)... so i take all of them and use voting classifier in step #3..
2. Hyperparameter optimization: use GridSearchCV to optimize
hyperparameters of each classifiers
3. Decision Region: use the hyperparameters from step #2, fit each
classifier separately to the whole dataset, and use voting classifier to
get decision region

This sounds reasonable?

Thank you very much!
Raga

On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> You are welcome! And in addition, if you select among different
> algorithms, here are some more suggestions
>
> a) don’t do it based on your independent test set if this is going to your
> final model performance estimate, or be aware that it would be overly
> optimistic
> b) also, it’s not the best idea to select algorithms using
> cross-validation on the same training set that you used for model
> selection; a more robust way would be nested CV (e.g,.
> http://scikit-learn.org/stable/auto_examples/model_
> selection/plot_nested_cross_validation_iris.html)
>
> But yeah, it all depends on your dataset and size. If you have a neural
> net that takes week to train, and if you have a large dataset anyway so
> that you can set aside large sets for testing, I’d train on
> train/validation splits and evaluate on the test set. And to compare e.g.,
> two networks against each other on large test sets, you could do a McNemar
> test.
>
> Best,
> Sebastian
>
> > On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.markely at gmail.com>
> wrote:
> >
> > Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
> >
> > Best,
> > Raga
> >
> > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > Hi, Raga,
> >
> > I think that if GridSearchCV is used for classification, the stratified
> k-fold doesn’t do shuffling by default.
> >
> > Say you do 20 grid search repetitions, you could then do sth like:
> >
> >
> > from sklearn.model_selection import StratifiedKFold
> >
> > for i in range(n_reps):
> >     k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
> >     gs = GridSearchCV(..., cv=k_fold)
> >     ...
> >
> > Best,
> > Sebastian
> >
> > > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.markely at gmail.com>
> wrote:
> > >
> > > Hello,
> > >
> > > I was trying to do repeated Grid Search CV (20 repeats). I thought
> that each time I call GridSearchCV, the training and test sets separated in
> different splits would be different.
> > >
> > > However, I got the same best_params_ and best_scores_ for all 20
> repeats. It looks like the training and test sets are separated in
> identical folds in each run? Just to clarify, e.g. I have the following
> data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv =
> 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I
> couldn't get [1,3] [0,2,4] or other combinations.
> > >
> > > If I understand correctly, GridSearchCV uses StratifiedKFold when I
> enter cv = integer. The StratifiedKFold command has random state; I wonder
> if there is anyway I can make the the training and test sets randomly
> separated each time I call the GridSearchCV?
> > >
> > > Just a note, I used the following classifiers: Logistic Regression,
> KNN, SVC, Kernel SVC, Random Forest, and had the same observation
> regardless of the classifiers.
> > >
> > > Thank you very much!
> > > Raga
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170127/9aa41c3f/attachment.html>


More information about the scikit-learn mailing list