[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Tue Dec 27 12:17:05 EST 2016

Dear Joel, Andrew and Roman,
                                                    Thank you very much for
your individual feedback ! It's very helpful indeed ! A few more points
related to my model execution:

1. By the term "scoring" I meant the process of executing the model once
again without retraining it. So , for training the model I used
RandomForestClassifer library and for my scoring (execution without
retraining) I have used joblib.dump and joblib.load

2. I have used the parameter n_estimator = 5000 while training my model.
Besides it , I have used n_jobs = -1 and haven't used any other parameter

3. For my "scoring" activity (executing the model without retraining it) is
there an alternate approach to joblib library ?

4. When I execute my scoring job (joblib method) on a dataset , which is
completely different to my training dataset then I get similar True
Positive Rate and False Positive Rate as of training

5. However, when I execute my scoring job on the same dataset used for
training my model then I get very high TPR and FPR.

                                                  Is there mechanism
through which I can visualise the trees created by my RandomForestClassifer
algorithm ? While I dumped the model using joblib.dump , there are a bunch
of .npy files created. Will those contain the trees ?

Thanks in advance !

Cheers,

Debu

On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> Your model is overfit to the training data. Not to say that it's
> necessarily possible to get a better fit. The default settings for trees
> lean towards a tight fit, so you might modify their parameters to increase
> regularisation. Still, you should not expect that evaluating a model's
> performance on its training data will be indicative of its general
> performance. This is why we use held-out test sets and cross-validation.
>
> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com> wrote:
>
>> Hi Debu,
>>
>> On 27/12/16 08:18, Andrew Howe wrote:
>> >      5. I got a prediction result with True Positive Rate (TPR) as 10-12
>> >         % on probability thresholds above 0.5
>>
>> Getting a high True Positive Rate (recall) is not a sufficient condition
>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>> could look at the precision at the same time (or consider, for instance,
>> the F1 score).
>>
>> >      7. I reloaded the model in a different python instance from the
>> >         pickle file mentioned above and did my scoring , i.e., used
>> >         joblib library load method and then instantiated prediction
>> >         (predict_proba method) on the entire set of my original 600 K
>> >         records
>> >               Another question – is there an alternate model scoring
>> >     library (apart from joblib, the one I am using) ?
>>
>> Joblib is not a scoring library; once you load a model from disk with
>> joblib you should get ~ the same RandomForestClassifier estimator object
>> as before saving it.
>>
>> >      8. Now when I am running (scoring) my model using
>> >         joblib.predict_proba on the entire set of original data (600 K),
>> >         I am getting a True Positive rate of around 80%.
>>
>> That sounds normal, considering what you are doing. Your entire set
>> consists of 80% of training set (for which the recall, I imagine, would
>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>> average you would get a recall close to 0.8 for the complete set. Unless
>> I missed something.
>>
>>
>> >      9. I did some  further analysis and figured out that during the
>> >         training process, when the model was predicting on the test
>> >         sample of 120K it could only predict 10-12% of 120K data beyond
>> >         a probability threshold of 0.5. When I am now trying to score my
>> >         model on the entire set of 600 K records, it appears that the
>> >         model is remembering some of it’s past behavior and data and
>> >         accordingly throwing 80% True positive rate
>>
>> It feels like your RandomForestClassifier is not properly tuned. A
>> recall of 0.1 on the test set is quite low. It could be worth trying to
>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>> other metric than the recall to evaluate the performance.
>>
>>
>> Roman
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/bf199bf2/attachment.html>