[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library
mailfordebu at gmail.com
Tue Dec 27 13:38:29 EST 2016
Thanks Guillaume for your quick feedback ! Appreciate it a lot.
I will definitely try out the links you have given. Another quick one
please. My objective is to execute the model without retraining it. Let me
get you an example here to elaborate this - I train my model on a huge set
of data (historic 6 months worth of data) and finalise my model. Now going
forward I need to run my model against smaller set of data (daily data) and
for that I wouldn't need to retrain my model daily.
Given the above scenario, I wanted to confirm once more whether after
training the model if I use joblib.dump and then while executing the model
on daily basis, if I use joblib.load then is this a good approach. I am
using joblib.dump(clf, 'model.pkl') and for loading , I am using
joblib.load('model.pkl). I amn't leveraging any of the *.npy files
generated in the folder.
Now, as you mentioned that joblib is a mechanism to save the data but my
objective is not to load the data used during the model training but only
the algorithm so that I can run the model on a fresh set of data after
loading data. And indeed my model is running fine after I execute the
joblib.load ('model.pkl) command but I wanted to confirm what it's doing
Thanks in advance !
On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <g.lemaitre58 at gmail.com
> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>> Dear Joel, Andrew and Roman,
>> Thank you very much
>> for your individual feedback ! It's very helpful indeed ! A few more points
>> related to my model execution:
>> 1. By the term "scoring" I meant the process of executing the model once
>> again without retraining it. So , for training the model I used
>> RandomForestClassifer library and for my scoring (execution without
>> retraining) I have used joblib.dump and joblib.load
> Go probably with the terms: training, validating, and testing.
> This is pretty much standard. Scoring is just the value of a
> metric given some data (training data, validation data, or
> testing data).
>> 2. I have used the parameter n_estimator = 5000 while training my model.
>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
> You should probably check those other parameters and understand
> what are their effects. You should really check the link of Roman
> since GridSearchCV can help you to decide how to fix the parameters.
> Additionally, 5000 trees seems a lot to me.
>> 3. For my "scoring" activity (executing the model without retraining it)
>> is there an alternate approach to joblib library ?
> Joblib only store data. There is not link with scoring (Check Roman answer)
>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>> completely different to my training dataset then I get similar True
>> Positive Rate and False Positive Rate as of training
> It is what you should get.
>> 5. However, when I execute my scoring job on the same dataset used for
>> training my model then I get very high TPR and FPR.
> You are testing on some data which you used while training. Probably,
> one of the first rule is to not do that. If you want to evaluate in some
> way your classifier, have a separate set (test set) and only test on that
> one. As previously mentioned by Roman, 80% of your data are already
> known by the RandomForestClassifier and will be perfectly classified.
>> Is there mechanism
>> through which I can visualise the trees created by my RandomForestClassifer
>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>> of .npy files created. Will those contain the trees ?
> You can visualize the trees with sklearn.tree.export_graphviz:
> The bunch of npy are the data needed to load the RandomForestClassifier
> you previously dumped.
>> Thanks in advance !
>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>>> Your model is overfit to the training data. Not to say that it's
>>> necessarily possible to get a better fit. The default settings for trees
>>> lean towards a tight fit, so you might modify their parameters to increase
>>> regularisation. Still, you should not expect that evaluating a model's
>>> performance on its training data will be indicative of its general
>>> performance. This is why we use held-out test sets and cross-validation.
>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>> Hi Debu,
>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>> > 5. I got a prediction result with True Positive Rate (TPR) as
>>>> > % on probability thresholds above 0.5
>>>> Getting a high True Positive Rate (recall) is not a sufficient condition
>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>> could look at the precision at the same time (or consider, for instance,
>>>> the F1 score).
>>>> > 7. I reloaded the model in a different python instance from the
>>>> > pickle file mentioned above and did my scoring , i.e., used
>>>> > joblib library load method and then instantiated prediction
>>>> > (predict_proba method) on the entire set of my original 600 K
>>>> > records
>>>> > Another question – is there an alternate model scoring
>>>> > library (apart from joblib, the one I am using) ?
>>>> Joblib is not a scoring library; once you load a model from disk with
>>>> joblib you should get ~ the same RandomForestClassifier estimator object
>>>> as before saving it.
>>>> > 8. Now when I am running (scoring) my model using
>>>> > joblib.predict_proba on the entire set of original data (600
>>>> > I am getting a True Positive rate of around 80%.
>>>> That sounds normal, considering what you are doing. Your entire set
>>>> consists of 80% of training set (for which the recall, I imagine, would
>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on
>>>> average you would get a recall close to 0.8 for the complete set. Unless
>>>> I missed something.
>>>> > 9. I did some further analysis and figured out that during the
>>>> > training process, when the model was predicting on the test
>>>> > sample of 120K it could only predict 10-12% of 120K data
>>>> > a probability threshold of 0.5. When I am now trying to score
>>>> > model on the entire set of 600 K records, it appears that the
>>>> > model is remembering some of it’s past behavior and data and
>>>> > accordingly throwing 80% True positive rate
>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>> recall of 0.1 on the test set is quite low. It could be worth trying to
>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>>>> other metric than the recall to evaluate the performance.
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>> scikit-learn mailing list
>> scikit-learn at python.org
> Guillaume Lemaitre
> INRIA Saclay - Ile-de-France
> Equipe PARIETAL
> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
> scikit-learn mailing list
> scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn