[scikit-learn] Problem with nested cross-validation example?

Joel Nothman joel.nothman at gmail.com
Tue Nov 29 04:50:47 EST 2016


This makes me a little sad. Do Albert and Daniel think the explicit
reference from blurb to code proposed at
https://github.com/scikit-learn/scikit-learn/pull/7949 is a sufficient
remedy? Otherwise could you please propose another clarifying change?
Thanks.

On 29 November 2016 at 20:04, Albert Thomas <albertthomas88 at gmail.com>
wrote:

> When I was reading Sebastian's blog posts on Cross Validation a few weeks
> ago I also found the example of Nested cross validation on scikit-learn. At
> first like Daniel I thought the example was not doing what it should be
> doing. But after a few minutes I finally realized that it was correct. So I
> am for a bit more clarification.
>
> Albert
>
> On Tue, 29 Nov 2016 at 02:53, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
>
>> On first glance, the image shown in the image and the code example seem
>> to do/show the same thing? Maybe it would be worth adding an explanatory
>> figure like this to the docs to clarify?
>>
>> > On Nov 28, 2016, at 7:07 PM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>> >
>> > If that clarifies, please offer changes to the example (as a pull
>> request) that make this clearer.
>> >
>> > On 29 November 2016 at 11:06, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>> > Briefly:
>> >
>> > clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
>> > nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
>> >
>> > Each train/test split in cross_val_score holds out test data.
>> GridSearchCV then splits each train set into (inner-)train and validation
>> sets. There is no leakage of test set knowledge from the outer loop into
>> the grid search optimisation; no leakage of validation set knowledge into
>> the SVR optimisation. The outer test data are reused as training data, but
>> within each split are only used to measure generalisation error.
>> >
>> > Is that clear?
>> >
>> > On 29 November 2016 at 10:30, Daniel Homola <dani.homola at gmail.com>
>> wrote:
>> > Dear all,
>> >
>> > I was wondering if the following example code is valid:
>> > http://scikit-learn.org/stable/auto_examples/model_
>> selection/plot_nested_cross_validation_iris.html
>> >
>> > My understanding is, that the point of nested cross-validation is to
>> prevent any data leakage from the inner grid-search/param optimization CV
>> loop into the outer model evaluation CV loop. This could be achieved if the
>> outer CV loop's test data is completely separated from the inner loop's CV,
>> as shown here:
>> > https://mlr-org.github.io/mlr-tutorial/release/html/img/
>> nested_resampling.png
>> >
>> > The code in the above example however doesn't seem to achieve this in
>> any way.
>> >
>> > Am I missing something here?
>> >
>> > Thanks a lot,
>> > dh
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161129/b90757ba/attachment.html>


More information about the scikit-learn mailing list