[scikit-learn] Trying to get learning curves with custom scorer and leave one group out
Matteo Niccoli
matteo at mycarta.ca
Fri Dec 2 22:28:38 EST 2016
HI all,
I want to plot learning curves on a trained SVM classifier, using a custom
scorer, and using Leave One Group Out as the method of crossvalidation. I
thought I had it figured out, but two different scorers - 'f1_micro' and
'accuracy' - will yield identical values. I am confused, is that supposed
to be the case?
Here's my code (unfortunately I cannot share the data as it is not open):
from sklearn import svm
SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, class_weight=None,
coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
training_data = pd.read_csv('training_data.csv')
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Targets'].values
groups = training_data["Groups"].values
Fscorer = make_scorer(f1_score, average = 'micro')
logo = LeaveOneGroupOut()
parm_range0 = np.logspace(-2, 6, 9)
train_scores0, test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X,
y, "C", parm_range0, cv =logo.split(X, y, groups=groups), scoring =
Fscorer)
Now, from:
train_scores_mean0 = np.mean(train_scores0, axis=1)
train_scores_std0 = np.std(train_scores0, axis=1)
test_scores_mean0 = np.mean(test_scores0, axis=1)
test_scores_std0 = np.std(test_scores0, axis=1)
print test_scores_mean0
print np.amax(test_scores_mean0)
print np.logspace(-2, 6, 9)[test_scores_mean0.argmax(axis=0)]
I get:
[ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438
0.49426622 0.48066419 0.4868987 ]
0.502174200206
100.0
If I create a new classifier, but with the same parameters, and run
everything exactly as before, except for the scoring, e.g.:
parm_range1 = np.logspace(-2, 6, 9)
train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X,
y, "C", parm_range1, cv =logo.split(X, y, groups=wells), scoring =
'accuracy')
train_scores_mean1 = np.mean(train_scores1, axis=1)
train_scores_std1= np.std(train_scores1, axis=1)
test_scores_mean1 = np.mean(test_scores1, axis=1)
test_scores_std1 = np.std(test_scores1, axis=1)
print test_scores_mean1
print np.amax(test_scores_mean1)
print np.logspace(-2, 6, 9)[test_scores_mean1.argmax(axis=0)]
I get exactly the same answer:
[ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438
0.49426622 0.48066419 0.4868987 ]
0.502174200206
100.0
How is that possible, am I doing something wrong, or missing something?
Thanks
More information about the scikit-learn
mailing list