[scikit-learn] Problem with nested cross-validation example?

Daniel Homola daniel.homola11 at imperial.ac.uk
Tue Nov 29 04:51:59 EST 2016


Hi Joel,

Thanks a lot for the answer.

"Each train/test split in cross_val_score holds out test data. 
GridSearchCV then splits each train set into (inner-)train and 
validation sets. "

I know this is what nested CV supposed to do but the code is doing an 
excellent job at obscuring this. I'll try and add some clarification in 
as comments later today.

Cheers,

d


On 29/11/16 00:07, Joel Nothman wrote:
> If that clarifies, please offer changes to the example (as a pull 
> request) that make this clearer.
>
> On 29 November 2016 at 11:06, Joel Nothman <joel.nothman at gmail.com 
> <mailto:joel.nothman at gmail.com>> wrote:
>
>     Briefly:
>
>     clf  =  GridSearchCV
>     <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr,  param_grid=p_grid,  cv=inner_cv)
>     nested_score  =  cross_val_score
>     <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf,  X=X_iris,  y=y_iris,  cv=outer_cv)
>
>
>     Each train/test split in cross_val_score holds out test data.
>     GridSearchCV then splits each train set into (inner-)train and
>     validation sets. There is no leakage of test set knowledge from
>     the outer loop into the grid search optimisation; no leakage of
>     validation set knowledge into the SVR optimisation. The outer test
>     data are reused as training data, but within each split are only
>     used to measure generalisation error.
>
>     Is that clear?
>
>     On 29 November 2016 at 10:30, Daniel Homola <dani.homola at gmail.com
>     <mailto:dani.homola at gmail.com>> wrote:
>
>         Dear all,
>
>
>         I was wondering if the following example code is valid:
>
>         http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
>         <http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html>
>
>         My understanding is, that the point of nested cross-validation
>         is to prevent any data leakage from the
>         inner grid-search/param optimization CV loop into the
>         outer model evaluation CV loop. This could be achieved if the
>         outer CV loop's test data is completely separated from the
>         inner loop's CV, as shown here:
>
>         https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
>         <https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png>
>
>
>         The code in the above example however doesn't seem to achieve
>         this in any way.
>
>
>         Am I missing something here?
>
>
>         Thanks a lot,
>
>         dh
>
>
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161129/f0ed616c/attachment.html>


More information about the scikit-learn mailing list