[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Debabrata Ghosh mailfordebu at gmail.com
Wed Dec 28 14:25:16 EST 2016


Hi Guillaume,
                          With respect to the following point you mentioned:
You can visualize the trees with sklearn.tree.export_graphviz:
http://scikit-learn.org/stable/modules/generated/sklearn.tre
e.export_graphviz.html

I couldn't find a direct method for exporting the RandomForestClassifier
trees. Accordingly, I attempted for a workaround using the following code
but still no success:

clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
clf.fit(p_features_train,p_labels_train)
for i, tree in enumerate(clf.estimators_):
    with open('tree_' + str(i) + '.dot', 'w') as dotfile:
         tree.export_graphviz(clf, dotfile)

Would you please be able to help me with the piece of code which I need to
execute for exporting the RandomForestClassifier trees.

Cheers,

Debu


On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <g.lemaitre58 at gmail.com
> wrote:

> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
> wrote:
>
>> Dear Joel, Andrew and Roman,
>>                                                     Thank you very much
>> for your individual feedback ! It's very helpful indeed ! A few more points
>> related to my model execution:
>>
>> 1. By the term "scoring" I meant the process of executing the model once
>> again without retraining it. So , for training the model I used
>> RandomForestClassifer library and for my scoring (execution without
>> retraining) I have used joblib.dump and joblib.load
>>
>
> Go probably with the terms: training, validating, and testing.
> This is pretty much standard. Scoring is just the value of a
> metric given some data (training data, validation data, or
> testing data).
>
>
>>
>> 2. I have used the parameter n_estimator = 5000 while training my model.
>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>>
>
> You should probably check those other parameters and understand
>  what are their effects. You should really check the link of Roman
> since GridSearchCV can help you to decide how to fix the parameters.
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
> GridSearchCV.html#sklearn.model_selection.GridSearchCV
> Additionally, 5000 trees seems a lot to me.
>
>
>>
>> 3. For my "scoring" activity (executing the model without retraining it)
>> is there an alternate approach to joblib library ?
>>
>
> Joblib only store data. There is not link with scoring (Check Roman answer)
>
>
>>
>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>> completely different to my training dataset then I get similar True
>> Positive Rate and False Positive Rate as of training
>>
>
> It is what you should get.
>
>
>>
>> 5. However, when I execute my scoring job on the same dataset used for
>> training my model then I get very high TPR and FPR.
>>
>
> You are testing on some data which you used while training. Probably,
> one of the first rule is to not do that. If you want to evaluate in some
> way your classifier, have a separate set (test set) and only test on that
> one. As previously mentioned by Roman, 80% of your data are already
> known by the RandomForestClassifier and will be perfectly classified.
>
>
>>
>>                                                   Is there mechanism
>> through which I can visualise the trees created by my RandomForestClassifer
>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>> of .npy files created. Will those contain the trees ?
>>
>
> You can visualize the trees with sklearn.tree.export_graphviz:
> http://scikit-learn.org/stable/modules/generated/
> sklearn.tree.export_graphviz.html
>
> The bunch of npy are the data needed to load the RandomForestClassifier
> which
> you previously dumped.
>
>
>>
>> Thanks in advance !
>>
>> Cheers,
>>
>> Debu
>>
>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>
>>> Your model is overfit to the training data. Not to say that it's
>>> necessarily possible to get a better fit. The default settings for trees
>>> lean towards a tight fit, so you might modify their parameters to increase
>>> regularisation. Still, you should not expect that evaluating a model's
>>> performance on its training data will be indicative of its general
>>> performance. This is why we use held-out test sets and cross-validation.
>>>
>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>> wrote:
>>>
>>>> Hi Debu,
>>>>
>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>> >      5. I got a prediction result with True Positive Rate (TPR) as
>>>> 10-12
>>>> >         % on probability thresholds above 0.5
>>>>
>>>> Getting a high True Positive Rate (recall) is not a sufficient condition
>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>> could look at the precision at the same time (or consider, for instance,
>>>> the F1 score).
>>>>
>>>> >      7. I reloaded the model in a different python instance from the
>>>> >         pickle file mentioned above and did my scoring , i.e., used
>>>> >         joblib library load method and then instantiated prediction
>>>> >         (predict_proba method) on the entire set of my original 600 K
>>>> >         records
>>>> >               Another question – is there an alternate model scoring
>>>> >     library (apart from joblib, the one I am using) ?
>>>>
>>>> Joblib is not a scoring library; once you load a model from disk with
>>>> joblib you should get ~ the same RandomForestClassifier estimator object
>>>> as before saving it.
>>>>
>>>> >      8. Now when I am running (scoring) my model using
>>>> >         joblib.predict_proba on the entire set of original data (600
>>>> K),
>>>> >         I am getting a True Positive rate of around 80%.
>>>>
>>>> That sounds normal, considering what you are doing. Your entire set
>>>> consists of 80% of training set (for which the recall, I imagine, would
>>>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>>>> average you would get a recall close to 0.8 for the complete set. Unless
>>>> I missed something.
>>>>
>>>>
>>>> >      9. I did some  further analysis and figured out that during the
>>>> >         training process, when the model was predicting on the test
>>>> >         sample of 120K it could only predict 10-12% of 120K data
>>>> beyond
>>>> >         a probability threshold of 0.5. When I am now trying to score
>>>> my
>>>> >         model on the entire set of 600 K records, it appears that the
>>>> >         model is remembering some of it’s past behavior and data and
>>>> >         accordingly throwing 80% True positive rate
>>>>
>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>> recall of 0.1 on the test set is quite low. It could be worth trying to
>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>>>> other metric than the recall to evaluate the performance.
>>>>
>>>>
>>>> Roman
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Ile-de-France
> Equipe PARIETAL
> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
> https://glemaitre.github.io/
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161229/ced361c5/attachment.html>


More information about the scikit-learn mailing list