[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Guillaume Lemaître g.lemaitre58 at gmail.com
Tue Dec 27 12:48:29 EST 2016


On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com> wrote:

> Dear Joel, Andrew and Roman,
>                                                     Thank you very much
> for your individual feedback ! It's very helpful indeed ! A few more points
> related to my model execution:
>
> 1. By the term "scoring" I meant the process of executing the model once
> again without retraining it. So , for training the model I used
> RandomForestClassifer library and for my scoring (execution without
> retraining) I have used joblib.dump and joblib.load
>

Go probably with the terms: training, validating, and testing.
This is pretty much standard. Scoring is just the value of a
metric given some data (training data, validation data, or
testing data).


>
> 2. I have used the parameter n_estimator = 5000 while training my model.
> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>

You should probably check those other parameters and understand
 what are their effects. You should really check the link of Roman
since GridSearchCV can help you to decide how to fix the parameters.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
Additionally, 5000 trees seems a lot to me.


>
> 3. For my "scoring" activity (executing the model without retraining it)
> is there an alternate approach to joblib library ?
>

Joblib only store data. There is not link with scoring (Check Roman answer)


>
> 4. When I execute my scoring job (joblib method) on a dataset , which is
> completely different to my training dataset then I get similar True
> Positive Rate and False Positive Rate as of training
>

It is what you should get.


>
> 5. However, when I execute my scoring job on the same dataset used for
> training my model then I get very high TPR and FPR.
>

You are testing on some data which you used while training. Probably,
one of the first rule is to not do that. If you want to evaluate in some
way your classifier, have a separate set (test set) and only test on that
one. As previously mentioned by Roman, 80% of your data are already
known by the RandomForestClassifier and will be perfectly classified.


>
>                                                   Is there mechanism
> through which I can visualise the trees created by my RandomForestClassifer
> algorithm ? While I dumped the model using joblib.dump , there are a bunch
> of .npy files created. Will those contain the trees ?
>

You can visualize the trees with sklearn.tree.export_graphviz:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

The bunch of npy are the data needed to load the RandomForestClassifier
which
you previously dumped.


>
> Thanks in advance !
>
> Cheers,
>
> Debu
>
> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Your model is overfit to the training data. Not to say that it's
>> necessarily possible to get a better fit. The default settings for trees
>> lean towards a tight fit, so you might modify their parameters to increase
>> regularisation. Still, you should not expect that evaluating a model's
>> performance on its training data will be indicative of its general
>> performance. This is why we use held-out test sets and cross-validation.
>>
>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>> wrote:
>>
>>> Hi Debu,
>>>
>>> On 27/12/16 08:18, Andrew Howe wrote:
>>> >      5. I got a prediction result with True Positive Rate (TPR) as
>>> 10-12
>>> >         % on probability thresholds above 0.5
>>>
>>> Getting a high True Positive Rate (recall) is not a sufficient condition
>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>> could look at the precision at the same time (or consider, for instance,
>>> the F1 score).
>>>
>>> >      7. I reloaded the model in a different python instance from the
>>> >         pickle file mentioned above and did my scoring , i.e., used
>>> >         joblib library load method and then instantiated prediction
>>> >         (predict_proba method) on the entire set of my original 600 K
>>> >         records
>>> >               Another question – is there an alternate model scoring
>>> >     library (apart from joblib, the one I am using) ?
>>>
>>> Joblib is not a scoring library; once you load a model from disk with
>>> joblib you should get ~ the same RandomForestClassifier estimator object
>>> as before saving it.
>>>
>>> >      8. Now when I am running (scoring) my model using
>>> >         joblib.predict_proba on the entire set of original data (600
>>> K),
>>> >         I am getting a True Positive rate of around 80%.
>>>
>>> That sounds normal, considering what you are doing. Your entire set
>>> consists of 80% of training set (for which the recall, I imagine, would
>>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>>> average you would get a recall close to 0.8 for the complete set. Unless
>>> I missed something.
>>>
>>>
>>> >      9. I did some  further analysis and figured out that during the
>>> >         training process, when the model was predicting on the test
>>> >         sample of 120K it could only predict 10-12% of 120K data beyond
>>> >         a probability threshold of 0.5. When I am now trying to score
>>> my
>>> >         model on the entire set of 600 K records, it appears that the
>>> >         model is remembering some of it’s past behavior and data and
>>> >         accordingly throwing 80% True positive rate
>>>
>>> It feels like your RandomForestClassifier is not properly tuned. A
>>> recall of 0.1 on the test set is quite low. It could be worth trying to
>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>>> other metric than the recall to evaluate the performance.
>>>
>>>
>>> Roman
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Ile-de-France
Equipe PARIETAL
guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/f8e93cf6/attachment-0001.html>


More information about the scikit-learn mailing list