[scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface?

Anton a.suchaneck at gmail.com
Sat Nov 12 04:17:29 EST 2016


Hi Andy!

Thank you for your feedback!

You say I shouldn't use __init__(**params) and it makes totally sense 
and would make my code much simpler. However,
sklearn 0.18, base.clone, line 70: new_object = 
klass(**new_object_params)
(called from RandomizedSearchCV)
screws you over since it passes the parameters to __init__(). I 
expected the usage of set_params() here, but I'm getting my gridsearch 
parameters passed to __init__().
Is this intended?

Note that I'm just wrapping a clf, so that I have to pass through the 
parameters to self.clf, right? No-one can know that I'm storing it in 
self.clf.
Therefore set_params needs to be implemented and cannot be inherited?!

My meta-classifier will find the optimal threshold upon .fit(). This 
procedure depends on how to interpret what is optimal and this is what 
find_threshold_cost_function is for.

One last question: Is self.classes_ a necessary part of the API (I 
realize I forget the underscore) and am I missing any other API detail 
I need to add for a binary classifier?

Regards,
Anton


Am Fr, 11. Nov, 2016 um 7:09 schrieb Andreas Mueller <t3kcit at gmail.com>:
> Hi.
> You don't have to implement set_params and get_params if you inherit 
> from BaseEstimator.
> I find it weird that you pass find_threshold_cost_function as a 
> constructor parameter but otherwise the API looks ok.
> You are not allowed to use **kwargs in __init___, though.
> 
> Andy
> 
> On 11/11/2016 05:23 AM, Anton Suchaneck wrote:
>> Hi!
>> 
>> I tried writing a ThresholdClassifier, that wraps any classifier 
>> with predict_proba() and based on a cost function adjusts the 
>> threshold for predict(). This helps for imbalanced data.
>> My current cost function assigns cost +cost for a true positive and 
>> -1 for a false positive.
>> It seems to run, but I'm not sure if I got the API for a classifier 
>> right.
>> 
>> Can you tell me whether this is how the functions should be 
>> implemented to play together with other parts of sklearn?
>> 
>> Especially parameter settings for base.clone both in klass.__init__ 
>> and .set_params() seemed weird.
>> 
>> Here is the code. The class ThresholdClassifier wraps a clf. 
>> RandomForest in this case.
>> 
>> Anton
>> 
>> from sklearn.base import BaseEstimator, ClassifierMixin
>> from functools import partial
>> 
>> def find_threshold_cost_factor(clf, X, y, cost_factor):
>>     y_pred = clf.predict_proba(X)
>> 
>>     top_score = 0
>>     top_threshold = None
>>     cur_score=0
>>     for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), 
>> reverse=True): # FIXME: assumes 2 classes
>>         if y_el == 0:
>>             cur_score -= 1
>>         if y_el == 1:
>>             cur_score += cost_factor
>>         if cur_score > top_score:
>>             top_score = cur_score
>>             top_threshold = y_pred_el
>>     return top_threshold, top_score
>> 
>> 
>> class ThresholdClassifier(BaseEstimator, ClassifierMixin):
>>     def __init__(self, clf, find_threshold, **params):
>>         self.clf = clf
>>         self.find_threshold = find_threshold
>>         self.threshold = None
>>         self.set_params(**params)
>> 
>>     def score(self, X, y, sample_weight=None):
>>         _threshold, score = self.find_threshold(self.clf, X, y)
>>         return score
>> 
>>     def fit(self, X, y):
>>         self.clf.fit(X, y)
>>         self.threshold, _score=self.find_threshold(self.clf, X, y)
>>         self.classes_ = self.clf.classes_
>> 
>>     def predict(self, X):
>>         y_score=self.clf.predict_proba(X)
>>         return np.array(y_score[:,1]>=self.threshold) # FIXME 
>> assumes 2 classes
>> 
>>     def predict_proba(self, X):
>>         return self.clf.predict_proba(X)
>> 
>>     def set_params(self, **params):
>>         for param_name in ["clf", "find_threshold", "threshold"]:
>>             if param_name in params:
>>                 setattr(self, param_name, params[param_name])
>>                 del params[param_name]
>>         self.clf.set_params(**params)
>>         return self
>> 
>>     def get_params(self, deep=True):
>>         params={"clf":self.clf, "find_threshold": 
>> self.find_threshold, "threshold":self.threshold}
>>         params.update(self.clf.get_params(deep))
>>         return params
>> 
>> 
>> if __name__ == '__main__':
>>     import numpy as np
>>     import random
>>     from sklearn.grid_search import RandomizedSearchCV
>>     from sklearn.ensemble import RandomForestClassifier
>>     from sklearn.datasets import make_classification
>>     from sklearn.cross_validation import train_test_split
>>     from sklearn.metrics import make_scorer, classification_report, 
>> confusion_matrix
>> 
>>     np.random.seed(111)
>>     random.seed(111)
>> 
>>     X, y = make_classification(1000,
>>                                n_features=20,
>>                                n_informative=4,
>>                                n_redundant=0,
>>                                n_repeated=0,
>>                                n_clusters_per_class=4,
>>                                # class_sep=0.5,
>>                                weights=[0.90]
>>                                )
>> 
>>     X_train, X_test, y_train, y_test = train_test_split(X, y, 
>> test_size=0.3, stratify=y)
>> 
>>     for cost in [10]:
>>         find_threshold=partial(find_threshold_cost_factor, 
>> cost_factor=10)
>> 
>>         def scorer(clf, X, y):
>>             return find_threshold(clf, X, y)[1]
>> 
>>         clfs = [RandomizedSearchCV(
>>                     ThresholdClassifier(RandomForestClassifier(), 
>> find_threshold),
>>                        {"n_estimators": [100, 200],
>>                         "criterion": ["entropy"],
>>                         "min_samples_leaf": [1, 5],
>>                         "class_weight": ["balanced", None],
>>                         },
>>                        cv=3,
>>                        scoring=scorer, # Get rid of this, by letting 
>> classifier tell it's cost-bsed score?
>>                        n_iter=8,
>>                        n_jobs=4),
>>                 ]
>> 
>>         for clf in clfs:
>>             clf.fit(X_train, y_train)
>>             clf_best = clf.best_estimator_
>>             print(clf_best, cost, clf_best.score(X_test, y_test))
>>             print(confusion_matrix(y_test, clf_best.predict(X_test)))
>>             #print(find_threshold(clf_best, X_train, y_train))
>>             #print(clf_best.threshold, 
>> sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train), 
>> reverse=True)[:20])
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161112/a6bfe9cd/attachment-0001.html>


More information about the scikit-learn mailing list