[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Tue Dec 27 14:12:39 EST 2016

On 27 December 2016 at 19:38, Debabrata Ghosh <mailfordebu at gmail.com> wrote:

> Thanks Guillaume for your quick feedback ! Appreciate it a lot.
>
> I will definitely try out the links you have given. Another quick one
> please. My objective is to execute the model without retraining it. Let me
> get you an example here to elaborate this - I train my model on a huge set
> of data (historic 6 months worth of data) and finalise my model. Now going
> forward I need to run my model against smaller set of data (daily data) and
> for that I wouldn't need to retrain my model daily.
>

So you just need to dump the model after training (which actually what you
did).

>
> Given the above scenario, I wanted to confirm once more whether after
> training the model if I use joblib.dump and then while executing the model
> on daily basis, if I use joblib.load then is this a good approach. I am
> using joblib.dump(clf, 'model.pkl') and for loading , I am using
> joblib.load('model.pkl). I amn't leveraging any of the *.npy files
> generated in the folder.
>

So, you need to train and dump the estimator. To predict with the dumped
model,
you need to load and use predict/predict_proba, etc.
The npy file are the file associated to your model. In the case of a random
forest
you need to keep the parameter of each trees. Having 5000 trees, you should
have many npy. The data themselves are not dumped.

>
> Now, as you mentioned that joblib is a mechanism to save the data but my
> objective is not to load the data used during the model training but only
> the algorithm so that I can run the model on a fresh set of data after
> loading data. And indeed my model is running fine after I execute the
> joblib.load ('model.pkl) command but I wanted to confirm what it's doing
> internally.
>
> Thanks in advance !
>
> Cheers,
>
> Debu
>
> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
> g.lemaitre58 at gmail.com> wrote:
>
>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Dear Joel, Andrew and Roman,
>>>                                                     Thank you very much
>>> for your individual feedback ! It's very helpful indeed ! A few more points
>>> related to my model execution:
>>>
>>> 1. By the term "scoring" I meant the process of executing the model once
>>> again without retraining it. So , for training the model I used
>>> RandomForestClassifer library and for my scoring (execution without
>>> retraining) I have used joblib.dump and joblib.load
>>>
>>
>> Go probably with the terms: training, validating, and testing.
>> This is pretty much standard. Scoring is just the value of a
>> metric given some data (training data, validation data, or
>> testing data).
>>
>>
>>>
>>> 2. I have used the parameter n_estimator = 5000 while training my model.
>>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>>>
>>
>> You should probably check those other parameters and understand
>>  what are their effects. You should really check the link of Roman
>> since GridSearchCV can help you to decide how to fix the parameters.
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>> Additionally, 5000 trees seems a lot to me.
>>
>>
>>>
>>> 3. For my "scoring" activity (executing the model without retraining it)
>>> is there an alternate approach to joblib library ?
>>>
>>
>> Joblib only store data. There is not link with scoring (Check Roman
>> answer)
>>
>>
>>>
>>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>>> completely different to my training dataset then I get similar True
>>> Positive Rate and False Positive Rate as of training
>>>
>>
>> It is what you should get.
>>
>>
>>>
>>> 5. However, when I execute my scoring job on the same dataset used for
>>> training my model then I get very high TPR and FPR.
>>>
>>
>> You are testing on some data which you used while training. Probably,
>> one of the first rule is to not do that. If you want to evaluate in some
>> way your classifier, have a separate set (test set) and only test on that
>> one. As previously mentioned by Roman, 80% of your data are already
>> known by the RandomForestClassifier and will be perfectly classified.
>>
>>
>>>
>>>                                                   Is there mechanism
>>> through which I can visualise the trees created by my RandomForestClassifer
>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>>> of .npy files created. Will those contain the trees ?
>>>
>>
>> You can visualize the trees with sklearn.tree.export_graphviz:
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> tree.export_graphviz.html
>>
>> The bunch of npy are the data needed to load the RandomForestClassifier
>> which
>> you previously dumped.
>>
>>
>>>
>>> Thanks in advance !
>>>
>>> Cheers,
>>>
>>> Debu
>>>
>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> Your model is overfit to the training data. Not to say that it's
>>>> necessarily possible to get a better fit. The default settings for trees
>>>> lean towards a tight fit, so you might modify their parameters to increase
>>>> regularisation. Still, you should not expect that evaluating a model's
>>>> performance on its training data will be indicative of its general
>>>> performance. This is why we use held-out test sets and cross-validation.
>>>>
>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Debu,
>>>>>
>>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>>> >      5. I got a prediction result with True Positive Rate (TPR) as
>>>>> 10-12
>>>>> >         % on probability thresholds above 0.5
>>>>>
>>>>> Getting a high True Positive Rate (recall) is not a sufficient
>>>>> condition
>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>>> could look at the precision at the same time (or consider, for
>>>>> instance,
>>>>> the F1 score).
>>>>>
>>>>> >      7. I reloaded the model in a different python instance from the
>>>>> >         pickle file mentioned above and did my scoring , i.e., used
>>>>> >         joblib library load method and then instantiated prediction
>>>>> >         (predict_proba method) on the entire set of my original 600 K
>>>>> >         records
>>>>> >               Another question – is there an alternate model scoring
>>>>> >     library (apart from joblib, the one I am using) ?
>>>>>
>>>>> Joblib is not a scoring library; once you load a model from disk with
>>>>> joblib you should get ~ the same RandomForestClassifier estimator
>>>>> object
>>>>> as before saving it.
>>>>>
>>>>> >      8. Now when I am running (scoring) my model using
>>>>> >         joblib.predict_proba on the entire set of original data (600
>>>>> K),
>>>>> >         I am getting a True Positive rate of around 80%.
>>>>>
>>>>> That sounds normal, considering what you are doing. Your entire set
>>>>> consists of 80% of training set (for which the recall, I imagine, would
>>>>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>>>>> average you would get a recall close to 0.8 for the complete set.
>>>>> Unless
>>>>> I missed something.
>>>>>
>>>>>
>>>>> >      9. I did some  further analysis and figured out that during the
>>>>> >         training process, when the model was predicting on the test
>>>>> >         sample of 120K it could only predict 10-12% of 120K data
>>>>> beyond
>>>>> >         a probability threshold of 0.5. When I am now trying to
>>>>> score my
>>>>> >         model on the entire set of 600 K records, it appears that the
>>>>> >         model is remembering some of it’s past behavior and data and
>>>>> >         accordingly throwing 80% True positive rate
>>>>>
>>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>>> recall of 0.1 on the test set is quite low. It could be worth trying to
>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>>>>> other metric than the recall to evaluate the performance.
>>>>>
>>>>>
>>>>> Roman
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Ile-de-France
>> Equipe PARIETAL
>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

-- 
Guillaume Lemaitre
INRIA Saclay - Ile-de-France
Equipe PARIETAL
guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161227/05f0d923/attachment-0001.html>