[scikit-learn] Control over the inner loop in GridSearchCV

Mon Feb 27 11:27:24 EST 2017

Hi, Ludovico,
what format (shape) is data in? Are these the arrays from a Kfold iterator? In this case, the “question marks” in your code snippet should simply be the train and validation subset indices generated by the KFold generator. E.g.,  

skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1)
for outer_train_idx, outer_valid_idx in skfold:
    …
    gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])

> 
> On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error.  
> Two questions: 

Are you using an version older than scikit-learn 0.18? Techically, the GridSearchCV, RandomizedSearchCV, cross_val_score … should all support iterables that of train_ and test_indices e.g.:

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

for name, gs_est in sorted(gridcvs.items()):
    nested_score = cross_val_score(gs_est,                 
    X=X_train,                      
    y=y_train,                                 
   cv=outer_cv,                             
   n_jobs=1)

Best,
Sebastian

> On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <ludo25_90 at hotmail.com> wrote:
> 
> Dear Scikit experts,
> 
> we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we hope you will. 
> 
> We are analysing neuroimaging data coming from 3 different MRI scanners, where for each scanner we have a healthy group and a disease group. We would like to merge the data from the 3 different scanners in order to classify the healthy subjects from the one who have the disease. 
> 
> The problem is that we can almost perfectly classify the subjects according to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are using a custom cross validation schema to account for the different scanners: when no hyper-parameter (SVM) optimization is performed, everything is straightforward. Problems arise when we would like to perform hyperparameter optimization: in this case we need to balance for the different scanner in the optimization phase as well. We also found a custom cv schema for this, but we are not able to pass it to GridSearchCV object. We would like to get something like the following:
> 
> pipeline = Pipeline([('scl', StandardScaler()),
>                     ('sel', RFE(estimator,step=0.2)),       
>                                     ('clf', SVC(probability=True, random_state=42))])
>                      
>                      
> param_grid = [{'sel__n_features_to_select':[22,15,10,2],
>                            'clf__C': np.logspace(-3, 5, 100), 
>                    'clf__kernel':['linear']}]
> 
> clf = GridSearchCV(pipeline, 
>                           param_grid=param_grid, 
>                   verbose=1, 
>                                   scoring='roc_auc', 
>                   n_jobs= -1)
> 
> # cv_final is the custom cv for the outer loop (9 folds)
> 
> ii = 0
> 
> while ii < len(cv_final):  
> # fit and predict
> 
> clf.fit(data[?]], y[[?]])
> predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> ii = ii + 1
> 
> We tried almost everything. When we define clf in the loop, we pass the -ith cv_nested as cv argument, and we fit it on the training data of the -ith custom_cv fold, we get an "Too many values to unpack" error. On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error.  
> Two questions: 
> 1) Is there any workaround to avoid the split when clf is called without a cv argument? 
> 2) We suppose that for hyperparameter optimization the test data is removed from the dataset and a  new dataset is created. Is this true? In this case we only have to adjust the indices accordingly
> 
> Thank your for your time and sorry for the long text
> Ludovico
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn