Dear Scikit experts, we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we hope you will. We are analysing neuroimaging data coming from 3 different MRI scanners, where for each scanner we have a healthy group and a disease group. We would like to merge the data from the 3 different scanners in order to classify the healthy subjects from the one who have the disease. The problem is that we can almost perfectly classify the subjects according to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are using a custom cross validation schema to account for the different scanners: when no hyper-parameter (SVM) optimization is performed, everything is straightforward. Problems arise when we would like to perform hyperparameter optimization: in this case we need to balance for the different scanner in the optimization phase as well. We also found a custom cv schema for this, but we are not able to pass it to GridSearchCV object. We would like to get something like the following: pipeline = Pipeline([('scl', StandardScaler()), ('sel', RFE(estimator,step=0.2)), ('clf', SVC(probability=True, random_state=42))]) param_grid = [{'sel__n_features_to_select':[22,15,10,2], 'clf__C': np.logspace(-3, 5, 100), 'clf__kernel':['linear']}] clf = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, scoring='roc_auc', n_jobs= -1) # cv_final is the custom cv for the outer loop (9 folds) ii = 0 while ii < len(cv_final): # fit and predict clf.fit(data[?]], y[[?]]) predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data ii = ii + 1 We tried almost everything. When we define clf in the loop, we pass the -ith cv_nested as cv argument, and we fit it on the training data of the -ith custom_cv fold, we get an "Too many values to unpack" error. On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error. Two questions: 1) Is there any workaround to avoid the split when clf is called without a cv argument? 2) We suppose that for hyperparameter optimization the test data is removed from the dataset and a new dataset is created. Is this true? In this case we only have to adjust the indices accordingly Thank your for your time and sorry for the long text Ludovico
Hi, Ludovico, what format (shape) is data in? Are these the arrays from a Kfold iterator? In this case, the “question marks” in your code snippet should simply be the train and validation subset indices generated by the KFold generator. E.g., skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1) for outer_train_idx, outer_valid_idx in skfold: … gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])
On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error. Two questions:
Are you using an version older than scikit-learn 0.18? Techically, the GridSearchCV, RandomizedSearchCV, cross_val_score … should all support iterables that of train_ and test_indices e.g.: outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) for name, gs_est in sorted(gridcvs.items()): nested_score = cross_val_score(gs_est, X=X_train, y=y_train, cv=outer_cv, n_jobs=1) Best, Sebastian
On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <ludo25_90@hotmail.com> wrote:
Dear Scikit experts,
we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we hope you will.
We are analysing neuroimaging data coming from 3 different MRI scanners, where for each scanner we have a healthy group and a disease group. We would like to merge the data from the 3 different scanners in order to classify the healthy subjects from the one who have the disease.
The problem is that we can almost perfectly classify the subjects according to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are using a custom cross validation schema to account for the different scanners: when no hyper-parameter (SVM) optimization is performed, everything is straightforward. Problems arise when we would like to perform hyperparameter optimization: in this case we need to balance for the different scanner in the optimization phase as well. We also found a custom cv schema for this, but we are not able to pass it to GridSearchCV object. We would like to get something like the following:
pipeline = Pipeline([('scl', StandardScaler()), ('sel', RFE(estimator,step=0.2)), ('clf', SVC(probability=True, random_state=42))])
param_grid = [{'sel__n_features_to_select':[22,15,10,2], 'clf__C': np.logspace(-3, 5, 100), 'clf__kernel':['linear']}]
clf = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, scoring='roc_auc', n_jobs= -1)
# cv_final is the custom cv for the outer loop (9 folds)
ii = 0
while ii < len(cv_final): # fit and predict
clf.fit(data[?]], y[[?]]) predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data ii = ii + 1
We tried almost everything. When we define clf in the loop, we pass the -ith cv_nested as cv argument, and we fit it on the training data of the -ith custom_cv fold, we get an "Too many values to unpack" error. On the other end, when we try to pass the nested -ith cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error. Two questions: 1) Is there any workaround to avoid the split when clf is called without a cv argument? 2) We suppose that for hyperparameter optimization the test data is removed from the dataset and a new dataset is created. Is this true? In this case we only have to adjust the indices accordingly
Thank your for your time and sorry for the long text Ludovico _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (2)
-
Ludovico Coletta -
Sebastian Raschka