[scikit-learn] Problem with nested cross-validation example?

Joel Nothman joel.nothman at gmail.com
Tue Nov 29 06:12:28 EST 2016


Offer whatever patches you think will help.

On 29 November 2016 at 22:01, Daniel Homola <daniel.homola11 at imperial.ac.uk>
wrote:

> Sorry, should've done that.
>
> Thanks for the PR. To me it isn't the actual concept of nested CV that
> needs more detailed explanation but the implementation in scikit-learn.
>
> I think it's not obvious at all for a newcomer (heck, I've been using it
> for years on and off and even I got confused) that the clf GridSearch
> object will carry it's inner CV object into the cross_val_score function,
> which has it's own outer CV object. Unless you know that in scikit-learn
> the CV object of an estimator is *NOT* overloaded with the
> cross_val_score function's cv parameter, but rather it will result in a
> nested CV, you simply cannot work out why this example works.. This is the
> confusing bit I think.. Do you want me to add comments that highlight this
> issue?
>
>
> On 29/11/16 10:48, Joel Nothman wrote:
>
> Wait an hour for the docs to build and you won't get artifact not found :)
>
> If you'd looked at the PR diff, you'd see I've modified the description to
> refer directly to GridSearchCV and cross_val_score:
>
> In the inner loop (here executed by GridSearchCV), the score is
>> approximately maximized by fitting a model to each training set, and then
>> directly maximized in selecting (hyper)parameters over the validation set.
>> In the outer loop (here in cross_val_score), ...
>
>
> Further comments in the code are welcome.
>
> On 29 November 2016 at 21:42, Albert Thomas <albertthomas88 at gmail.com>
> wrote:
>
>> I also get "artifact not found". And I agree with Daniel.
>>
>> Once you decompose what the code is doing you realize that it does the
>> job. The simplicity of the code to perform nested cross validation using
>> scikit learn objects is impressive but I guess it also makes it less
>> obvious. So making the example clearer by explaining what the code does or
>> by adding a few comments can be useful for others.
>>
>> Albert
>>
>> On Tue, 29 Nov 2016 at 11:19, Daniel Homola <
>> daniel.homola11 at imperial.ac.uk> wrote:
>>
>>> Hi Joel,
>>>
>>> Thanks a lot for the answer.
>>>
>>> "Each train/test split in cross_val_score holds out test data.
>>> GridSearchCV then splits each train set into (inner-)train and validation
>>> sets. "
>>>
>>> I know this is what nested CV supposed to do but the code is doing an
>>> excellent job at obscuring this. I'll try and add some clarification in as
>>> comments later today.
>>>
>>> Cheers,
>>>
>>> d
>>>
>>>
>>> On 29/11/16 00:07, Joel Nothman wrote:
>>>
>>> If that clarifies, please offer changes to the example (as a pull
>>> request) that make this clearer.
>>>
>>> On 29 November 2016 at 11:06, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>> Briefly:
>>>
>>> clf = GridSearchCV <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf, X=X_iris, y=y_iris, cv=outer_cv)
>>>
>>>
>>> Each train/test split in cross_val_score holds out test data.
>>> GridSearchCV then splits each train set into (inner-)train and validation
>>> sets. There is no leakage of test set knowledge from the outer loop into
>>> the grid search optimisation; no leakage of validation set knowledge into
>>> the SVR optimisation. The outer test data are reused as training data, but
>>> within each split are only used to measure generalisation error.
>>>
>>> Is that clear?
>>>
>>> On 29 November 2016 at 10:30, Daniel Homola <dani.homola at gmail.com>
>>> wrote:
>>>
>>> Dear all,
>>>
>>>
>>> I was wondering if the following example code is valid:
>>>
>>> http://scikit-learn.org/stable/auto_examples/model_selection
>>> /plot_nested_cross_validation_iris.html
>>>
>>> My understanding is, that the point of nested cross-validation is to
>>> prevent any data leakage from the inner grid-search/param optimization CV
>>> loop into the outer model evaluation CV loop. This could be achieved if the
>>> outer CV loop's test data is completely separated from the inner loop's CV,
>>> as shown here:
>>>
>>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nest
>>> ed_resampling.png
>>>
>>>
>>> The code in the above example however doesn't seem to achieve this in
>>> any way.
>>>
>>>
>>> Am I missing something here?
>>>
>>>
>>> Thanks a lot,
>>>
>>> dh
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________ scikit-learn mailing
>>> list scikit-learn at python.org https://mail.python.org/mailma
>>> n/listinfo/scikit-learn
>>
>> _______________________________________________ scikit-learn mailing
>> list scikit-learn at python.org https://mail.python.org/mailma
>> n/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161129/a7167c41/attachment-0001.html>


More information about the scikit-learn mailing list