[scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface?

Andreas Mueller t3kcit at gmail.com
Fri Nov 11 13:09:32 EST 2016


Hi.
You don't have to implement set_params and get_params if you inherit 
from BaseEstimator.
I find it weird that you pass find_threshold_cost_function as a 
constructor parameter but otherwise the API looks ok.
You are not allowed to use **kwargs in __init___, though.

Andy

On 11/11/2016 05:23 AM, Anton Suchaneck wrote:
> Hi!
>
> I tried writing a ThresholdClassifier, that wraps any classifier with 
> predict_proba() and based on a cost function adjusts the threshold for 
> predict(). This helps for imbalanced data.
> My current cost function assigns cost +cost for a true positive and -1 
> for a false positive.
> It seems to run, but I'm not sure if I got the API for a classifier right.
>
> Can you tell me whether this is how the functions should be 
> implemented to play together with other parts of sklearn?
>
> Especially parameter settings for base.clone both in klass.__init__ 
> and .set_params() seemed weird.
>
> Here is the code. The class ThresholdClassifier wraps a clf. 
> RandomForest in this case.
>
> Anton
>
> from sklearn.base import BaseEstimator, ClassifierMixin
> from functools import partial
>
> def find_threshold_cost_factor(clf, X, y, cost_factor):
>     y_pred = clf.predict_proba(X)
>
>     top_score = 0
>     top_threshold = None
>     cur_score=0
>     for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), reverse=True): 
> # FIXME: assumes 2 classes
>         if y_el == 0:
>             cur_score -= 1
>         if y_el == 1:
>             cur_score += cost_factor
>         if cur_score > top_score:
>             top_score = cur_score
>             top_threshold = y_pred_el
>     return top_threshold, top_score
>
>
> class ThresholdClassifier(BaseEstimator, ClassifierMixin):
>     def __init__(self, clf, find_threshold, **params):
>         self.clf = clf
>         self.find_threshold = find_threshold
>         self.threshold = None
>         self.set_params(**params)
>
>     def score(self, X, y, sample_weight=None):
>         _threshold, score = self.find_threshold(self.clf, X, y)
>         return score
>
>     def fit(self, X, y):
>         self.clf.fit(X, y)
>         self.threshold, _score=self.find_threshold(self.clf, X, y)
>         self.classes_ = self.clf.classes_
>
>     def predict(self, X):
>         y_score=self.clf.predict_proba(X)
>         return np.array(y_score[:,1]>=self.threshold) # FIXME assumes 
> 2 classes
>
>     def predict_proba(self, X):
>         return self.clf.predict_proba(X)
>
>     def set_params(self, **params):
>         for param_name in ["clf", "find_threshold", "threshold"]:
>             if param_name in params:
>                 setattr(self, param_name, params[param_name])
>                 del params[param_name]
>         self.clf.set_params(**params)
>         return self
>
>     def get_params(self, deep=True):
>         params={"clf":self.clf, "find_threshold": self.find_threshold, 
> "threshold":self.threshold}
>         params.update(self.clf.get_params(deep))
>         return params
>
>
> if __name__ == '__main__':
>     import numpy as np
>     import random
>     from sklearn.grid_search import RandomizedSearchCV
>     from sklearn.ensemble import RandomForestClassifier
>     from sklearn.datasets import make_classification
>     from sklearn.cross_validation import train_test_split
>     from sklearn.metrics import make_scorer, classification_report, 
> confusion_matrix
>
>     np.random.seed(111)
>     random.seed(111)
>
>     X, y = make_classification(1000,
>                                n_features=20,
>                                n_informative=4,
>                                n_redundant=0,
>                                n_repeated=0,
>                                n_clusters_per_class=4,
>                                # class_sep=0.5,
>                                weights=[0.90]
>                                )
>
>     X_train, X_test, y_train, y_test = train_test_split(X, y, 
> test_size=0.3, stratify=y)
>
>     for cost in [10]:
>         find_threshold=partial(find_threshold_cost_factor, cost_factor=10)
>
>         def scorer(clf, X, y):
>             return find_threshold(clf, X, y)[1]
>
>         clfs = [RandomizedSearchCV(
> ThresholdClassifier(RandomForestClassifier(), find_threshold),
>                        {"n_estimators": [100, 200],
>                         "criterion": ["entropy"],
>                         "min_samples_leaf": [1, 5],
>                         "class_weight": ["balanced", None],
>                         },
>                        cv=3,
>                        scoring=scorer, # Get rid of this, by letting 
> classifier tell it's cost-bsed score?
>                        n_iter=8,
>                        n_jobs=4),
>                 ]
>
>         for clf in clfs:
>             clf.fit(X_train, y_train)
>             clf_best = clf.best_estimator_
>             print(clf_best, cost, clf_best.score(X_test, y_test))
>             print(confusion_matrix(y_test, clf_best.predict(X_test)))
>             #print(find_threshold(clf_best, X_train, y_train))
>             #print(clf_best.threshold, 
> sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train), 
> reverse=True)[:20])
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161111/97887d2a/attachment.html>


More information about the scikit-learn mailing list