<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif;" dir="ltr">
<p>Dear Scikit experts,</p>
<p><br>
</p>
<p>we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we hope you will. </p>
<p><br>
</p>
<p>We are analysing neuroimaging data coming from 3 different MRI scanners, where for each scanner we have a healthy group and a disease group. We would like to merge the data from the 3 different scanners in order to classify the healthy subjects from the
one who have the disease. </p>
<p><br>
</p>
<p>The problem is that we can almost perfectly classify the subjects according to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are using a custom cross validation schema to account for the different scanners: when no hyper-parameter
(SVM) optimization is performed, everything is straightforward. Problems arise when we would like to perform hyperparameter optimization: in this case we need to balance for the different scanner in the optimization phase as well. We also found a custom cv
schema for this, but we are not able to pass it to GridSearchCV object. We would like to get something like the following:</p>
<p><br>
</p>
<p></p>
<div><i>pipeline = Pipeline([('scl', StandardScaler()),</i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span> ('sel', RFE(estimator,step=0.2)),<span style="font-size: 12pt;"> </span></i></div>
<div><i> ('clf', SVC(probability=True, random_state=42))<span style="font-size: 12pt;">])</span></i></div>
<div><i> </i></div>
<div><i> </i></div>
<div><i>param_grid = [{'sel__n_features_to_select':[22,15,10,2],</i></div>
<div><i><span style="font-size: 12pt;"> 'clf__C': np.logspace(-3, 5, 100), </span><br>
</i></div>
<div><i><span class="Apple-tab-span" style="font-size: 12pt; white-space: pre;"></span><span style="font-size: 12pt;"> 'clf__kernel':['linear']</span><span style="font-size: 12pt;">}]</span></i></div>
<div><i><br>
</i></div>
<div><i><span style="font-size: 12pt;">clf = GridSearchCV(pipeline, </span><br>
</i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span> param_grid=param_grid, </i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span> verbose=1, </i></div>
<div><i> <span style="font-size: 12pt;">scoring='roc_auc', </span></i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span> n_jobs= -1)</i></div>
<div><i><br>
</i></div>
<div><i><span style="font-family: Calibri, Arial, Helvetica, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", EmojiSymbols; font-size: 16px;"># cv_final is the custom cv for the outer loop (9 folds)</span><br>
</i></div>
<div><span style="font-family: Calibri, Arial, Helvetica, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", EmojiSymbols; font-size: 16px;"><i><br>
</i></span></div>
<div><i>ii = 0</i></div>
<div><i><br>
</i></div>
<div><i>while ii < len(cv_final): </i></div>
<div><span class="Apple-tab-span" style="white-space:pre"><i></i></span></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span># fit and predict</i></div>
<div><i><br>
</i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span>clf.fit(data[<b><span style="color: rgb(255, 0, 0);">?</span></b>]], y[[<b><span style="color: rgb(255, 0, 0);">?</span></b>]])</i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span>predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data</i></div>
<div><i><span class="Apple-tab-span" style="white-space:pre"></span>ii = ii + 1</i></div>
<br>
<p></p>
<p>We tried almost everything. When we define clf in the loop, we pass the -i<i>th</i> cv_nested as cv argument, and we fit it on the training data of the -i<i>th</i> custom_cv fold, we get an "Too many values to unpack" error. On the other end, when we try
to pass the nested <i>-ith</i> cv fold as cv argument for clf, and we call fit on the same cv_nested fold, we get an "Index out of bound" error. </p>
<p><span style="font-size: 12pt;">Two questions: </span><br>
</p>
<p>1) Is there any workaround to avoid the split when clf is called without a cv argument? </p>
<p>2) We suppose that for hyperparameter optimization the test data is removed from the dataset and a new dataset is created. Is this true? In this case we only have to adjust the indices accordingly</p>
<p><br>
</p>
<p>Thank your for your time and sorry for the long text</p>
<p>Ludovico</p>
</div>
</body>
</html>