[scikit-learn] Problem with nested cross-validation example?

Tue Nov 29 09:10:56 EST 2016

I have an ipynb where I did the nested CV more “manually” in sklearn 0.17 vs sklearn 0.18 — I intended to add it as an appendix to a blog article (model eval part 4), which I had no chance to write, yet. Maybe the sklearn 0.17 part is a bit more obvious (although way less elegant) than the sklearn 0.18 version and is helpful in some sort to see what’s going on: https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb (haven’t had a chance to add comments yet, though).

Btw. does anyone have a good (research article) reference for nested CV?

I see people often referrering to Dietterich [1], who mentions 5x2 CV. However, I think his 5x2 CV approach is different from the “nested cross-validation” that is commonly used since the 5x2 example is just 2-fold CV repeated 5 times (10 estimates). Maybe Sudhir & Simon [2] would be a better reference? However, they seem to only hold out 1 test sample in the outer fold? Does anyone know of a nice empirical study on nested CV (sth. like Ron Kohavi's for k-fold CV)?

[1] Dietterich, Thomas G. 1998. “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.” Neural Computation 10 (7). MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-info at mit.edu: 1895–1923. doi:10.1162/089976698300017197.

[2] Varma, Sudhir, and Richard Simon. 2006. “Bias in Error Estimation When Using Cross-Validation for Model Selection.” BMC Bioinformatics 7: 91. doi:10.1186/1471-2105-7-91.

> On Nov 29, 2016, at 6:12 AM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> Offer whatever patches you think will help.
> 
> On 29 November 2016 at 22:01, Daniel Homola <daniel.homola11 at imperial.ac.uk> wrote:
> Sorry, should've done that. 
> Thanks for the PR. To me it isn't the actual concept of nested CV that needs more detailed explanation but the implementation in scikit-learn. 
> I think it's not obvious at all for a newcomer (heck, I've been using it for years on and off and even I got confused) that the clf GridSearch object will carry it's inner CV object into the cross_val_score function, which has it's own outer CV object. Unless you know that in scikit-learn the CV object of an estimator is NOT overloaded with the cross_val_score function's cv parameter, but rather it will result in a nested CV, you simply cannot work out why this example works.. This is the confusing bit I think.. Do you want me to add comments that highlight this issue?
> 
> 
> On 29/11/16 10:48, Joel Nothman wrote:
>> Wait an hour for the docs to build and you won't get artifact not found :)
>> 
>> If you'd looked at the PR diff, you'd see I've modified the description to refer directly to GridSearchCV and cross_val_score:
>> 
>> In the inner loop (here executed by GridSearchCV), the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop (here in cross_val_score), ...
>> 
>> Further comments in the code are welcome.
>> 
>> On 29 November 2016 at 21:42, Albert Thomas <albertthomas88 at gmail.com> wrote:
>> I also get "artifact not found". And I agree with Daniel.
>> 
>> Once you decompose what the code is doing you realize that it does the job. The simplicity of the code to perform nested cross validation using scikit learn objects is impressive but I guess it also makes it less obvious. So making the example clearer by explaining what the code does or by adding a few comments can be useful for others.
>> 
>> Albert 
>> 
>> On Tue, 29 Nov 2016 at 11:19, Daniel Homola <daniel.homola11 at imperial.ac.uk> wrote:
>> Hi Joel,
>> 
>> Thanks a lot for the answer.
>> "Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. "
>> 
>> I know this is what nested CV supposed to do but the code is doing an excellent job at obscuring this. I'll try and add some clarification in as comments later today.
>> 
>> Cheers,
>> 
>> d
>> 
>> On 29/11/16 00:07, Joel Nothman wrote:
>>> If that clarifies, please offer changes to the example (as a pull request) that make this clearer.
>>> 
>>> On 29 November 2016 at 11:06, Joel Nothman <joel.nothman at gmail.com> wrote:
>>> Briefly:
>>> 
>>> clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
>>> nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
>>> 
>>> Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. There is no leakage of test set knowledge from the outer loop into the grid search optimisation; no leakage of validation set knowledge into the SVR optimisation. The outer test data are reused as training data, but within each split are only used to measure generalisation error.
>>> 
>>> Is that clear?
>>> 
>>> On 29 November 2016 at 10:30, Daniel Homola <dani.homola at gmail.com> wrote:
>>> Dear all,
>>> 
>>> I was wondering if the following example code is valid:
>>> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
>>> 
>>> My understanding is, that the point of nested cross-validation is to prevent any data leakage from the inner grid-search/param optimization CV loop into the outer model evaluation CV loop. This could be achieved if the outer CV loop's test data is completely separated from the inner loop's CV, as shown here:
>>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
>>> 
>>> The code in the above example however doesn't seem to achieve this in any way.
>>> 
>>> Am I missing something here? 
>>> 
>>> Thanks a lot,
>>> dh
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ______________________________
>>> _________________
>>> scikit-learn mailing list
>>> 
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ______________________________
>> _________________
>> scikit-learn mailing list
>> 
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn