[scikit-learn] creating a custom scoring function for cross-validation of classification

Sumeet Sandhu sumeet.k.sandhu at gmail.com
Mon Oct 31 16:28:43 EDT 2016


I've been staring at various doc pages for a while to create a custom
scorer that uses predict_proba output of a multi-class SGDClassifier :

I got the impression I could customize the "scoring'' parameter in
cross_val_score directly, but that didn't work.
Then I tried customizing the "score_func" parameter in make_scorer, but
that didn't work either. Both errors are ValuErrors :

Traceback (most recent call last):
  File "<pyshell#96>", line 3, in <module>
    accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs,
trainLabelVecs, cv=10, scoring = 'topNscorer'))
line 1425, in cross_val_score
    scorer = check_scoring(estimator, scoring=scoring)
line 238, in check_scoring
    return get_scorer(scoring)
line 197, in get_scorer
    % (scoring, sorted(SCORERS.keys())))
ValueError: 'topNscorer' is not a valid scoring value. Valid options are
['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro',
'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error',
'mean_squared_error', 'median_absolute_error', 'precision',
'precision_macro', 'precision_micro', 'precision_samples',
'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro',
'recall_samples', 'recall_weighted', 'roc_auc']

At a high level, I want to find out if the true label was found in the top
N multi-class labels coming out of an SGD classifier. Built-in scores like
"accuracy" only look at N=1.

Here is the code using make_scorer :
        LRclassifier = SGDClassifier(loss='log')
        topNscorer = make_scorer(topNscoring, greater_is_better=True,
        accuracyN = mean(cross_val_score(LRclassifier, Data, Labels,
scoring = 'topNscorer'))

Here is the code for the custom scoring function :
def topNscoring(y, yp):
    ## Inputs y = true label per sample, yp = predict_proba probabilities
of all labels per sample
    N = 5
    foundN = []
    for ii in xrange(0,shape(yp)[0]):
        indN = [ w[0] for w in sorted(enumerate(list(yp[ii,:])),key=lambda
w:w[1],reverse=True)[0:N] ]
        if y[ii] in indN:             foundN.append(1)
        else:             foundN.append(0)
    return mean(foundN)

Any help will be greatly appreciated.

best regards,
