[scikit-learn] Control over the inner loop in GridSearchCV
Sebastian Raschka
se.raschka at gmail.com
Mon Feb 27 11:27:24 EST 2017
Hi, Ludovico,
what format (shape) is data in? Are these the arrays from a Kfold iterator? In this case, the “question marks” in your code snippet should simply be the train and validation subset indices generated by the KFold generator. E.g.,
skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1)
for outer_train_idx, outer_valid_idx in skfold:
…
gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])
>
> On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error.
> Two questions:
Are you using an version older than scikit-learn 0.18? Techically, the GridSearchCV, RandomizedSearchCV, cross_val_score … should all support iterables that of train_ and test_indices e.g.:
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
for name, gs_est in sorted(gridcvs.items()):
nested_score = cross_val_score(gs_est,
X=X_train,
y=y_train,
cv=outer_cv,
n_jobs=1)
Best,
Sebastian
> On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <ludo25_90 at hotmail.com> wrote:
>
> Dear Scikit experts,
>
> we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we hope you will.
>
> We are analysing neuroimaging data coming from 3 different MRI scanners, where for each scanner we have a healthy group and a disease group. We would like to merge the data from the 3 different scanners in order to classify the healthy subjects from the one who have the disease.
>
> The problem is that we can almost perfectly classify the subjects according to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are using a custom cross validation schema to account for the different scanners: when no hyper-parameter (SVM) optimization is performed, everything is straightforward. Problems arise when we would like to perform hyperparameter optimization: in this case we need to balance for the different scanner in the optimization phase as well. We also found a custom cv schema for this, but we are not able to pass it to GridSearchCV object. We would like to get something like the following:
>
> pipeline = Pipeline([('scl', StandardScaler()),
> ('sel', RFE(estimator,step=0.2)),
> ('clf', SVC(probability=True, random_state=42))])
>
>
> param_grid = [{'sel__n_features_to_select':[22,15,10,2],
> 'clf__C': np.logspace(-3, 5, 100),
> 'clf__kernel':['linear']}]
>
> clf = GridSearchCV(pipeline,
> param_grid=param_grid,
> verbose=1,
> scoring='roc_auc',
> n_jobs= -1)
>
> # cv_final is the custom cv for the outer loop (9 folds)
>
> ii = 0
>
> while ii < len(cv_final):
> # fit and predict
>
> clf.fit(data[?]], y[[?]])
> predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> ii = ii + 1
>
> We tried almost everything. When we define clf in the loop, we pass the -ith cv_nested as cv argument, and we fit it on the training data of the -ith custom_cv fold, we get an "Too many values to unpack" error. On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error.
> Two questions:
> 1) Is there any workaround to avoid the split when clf is called without a cv argument?
> 2) We suppose that for hyperparameter optimization the test data is removed from the dataset and a new dataset is created. Is this true? In this case we only have to adjust the indices accordingly
>
> Thank your for your time and sorry for the long text
> Ludovico
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list