[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library
Joel Nothman
joel.nothman at gmail.com
Tue Dec 27 05:52:30 EST 2016
Your model is overfit to the training data. Not to say that it's
necessarily possible to get a better fit. The default settings for trees
lean towards a tight fit, so you might modify their parameters to increase
regularisation. Still, you should not expect that evaluating a model's
performance on its training data will be indicative of its general
performance. This is why we use held-out test sets and cross-validation.
On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com> wrote:
> Hi Debu,
>
> On 27/12/16 08:18, Andrew Howe wrote:
> > 5. I got a prediction result with True Positive Rate (TPR) as 10-12
> > % on probability thresholds above 0.5
>
> Getting a high True Positive Rate (recall) is not a sufficient condition
> for a well behaved model. Though 0.1 recall is still pretty bad. You
> could look at the precision at the same time (or consider, for instance,
> the F1 score).
>
> > 7. I reloaded the model in a different python instance from the
> > pickle file mentioned above and did my scoring , i.e., used
> > joblib library load method and then instantiated prediction
> > (predict_proba method) on the entire set of my original 600 K
> > records
> > Another question – is there an alternate model scoring
> > library (apart from joblib, the one I am using) ?
>
> Joblib is not a scoring library; once you load a model from disk with
> joblib you should get ~ the same RandomForestClassifier estimator object
> as before saving it.
>
> > 8. Now when I am running (scoring) my model using
> > joblib.predict_proba on the entire set of original data (600 K),
> > I am getting a True Positive rate of around 80%.
>
> That sounds normal, considering what you are doing. Your entire set
> consists of 80% of training set (for which the recall, I imagine, would
> be close to 1.0) and 20 % test set (with a recall of 0.1), so on
> average you would get a recall close to 0.8 for the complete set. Unless
> I missed something.
>
>
> > 9. I did some further analysis and figured out that during the
> > training process, when the model was predicting on the test
> > sample of 120K it could only predict 10-12% of 120K data beyond
> > a probability threshold of 0.5. When I am now trying to score my
> > model on the entire set of 600 K records, it appears that the
> > model is remembering some of it’s past behavior and data and
> > accordingly throwing 80% True positive rate
>
> It feels like your RandomForestClassifier is not properly tuned. A
> recall of 0.1 on the test set is quite low. It could be worth trying to
> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
> other metric than the recall to evaluate the performance.
>
>
> Roman
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/60d30bf9/attachment.html>
More information about the scikit-learn
mailing list