[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Tue Dec 27 04:51:39 EST 2016

Hi Debu,

On 27/12/16 08:18, Andrew Howe wrote:
>      5. I got a prediction result with True Positive Rate (TPR) as 10-12
>         % on probability thresholds above 0.5

Getting a high True Positive Rate (recall) is not a sufficient condition
for a well behaved model. Though 0.1 recall is still pretty bad. You
could look at the precision at the same time (or consider, for instance,
the F1 score).

>      7. I reloaded the model in a different python instance from the
>         pickle file mentioned above and did my scoring , i.e., used
>         joblib library load method and then instantiated prediction
>         (predict_proba method) on the entire set of my original 600 K
>         records 
>               Another question – is there an alternate model scoring
>     library (apart from joblib, the one I am using) ?

Joblib is not a scoring library; once you load a model from disk with
joblib you should get ~ the same RandomForestClassifier estimator object
as before saving it.

>      8. Now when I am running (scoring) my model using
>         joblib.predict_proba on the entire set of original data (600 K),
>         I am getting a True Positive rate of around 80%. 

That sounds normal, considering what you are doing. Your entire set
consists of 80% of training set (for which the recall, I imagine, would
be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
average you would get a recall close to 0.8 for the complete set. Unless
I missed something.

>      9. I did some  further analysis and figured out that during the
>         training process, when the model was predicting on the test
>         sample of 120K it could only predict 10-12% of 120K data beyond
>         a probability threshold of 0.5. When I am now trying to score my
>         model on the entire set of 600 K records, it appears that the
>         model is remembering some of it’s past behavior and data and
>         accordingly throwing 80% True positive rate

It feels like your RandomForestClassifier is not properly tuned. A
recall of 0.1 on the test set is quite low. It could be worth trying to
tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
other metric than the recall to evaluate the performance.

Roman