[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Naoya Kanai naopon at gmail.com
Wed Dec 28 23:50:36 EST 2016


The ‘tree’ name is clashing between the sklearn.tree module and the
DecisionTreeClassifier objects in the loop.

You can change the import to

from sklearn.tree import export_graphviz

and modify the method call accordingly.
​

On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh <mailfordebu at gmail.com>
wrote:

> Hi Guillaume,
>                                       Thanks for your feedback ! I am
> still getting an error, while attempting to print the trees. Here is a
> snapshot of my code. I know I may be missing something very silly, but
> still wanted to check and see how this works.
>
> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
> >>> clf.fit(p_features_train,p_labels_train)
> RandomForestClassifier(bootstrap=True, class_weight=None,
> criterion='gini',
>             max_depth=None, max_features='auto', max_leaf_nodes=None,
>             min_samples_leaf=1, min_samples_split=2,
>             min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
>             oob_score=False, random_state=None, verbose=0,
>             warm_start=False)
> >>> for idx_tree, tree in enumerate(clf.estimators_):
> ...     export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> NameError: name 'export_graphviz' is not defined
> >>> for idx_tree, tree in enumerate(clf.estimators_):
> ...     tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> AttributeError: 'DecisionTreeClassifier' object has no attribute
> 'export_graphviz'
>
> Just to give you  a background about the libraries, I have imported the
> following libraries:
>
> from sklearn.ensemble import RandomForestClassifier
> from sklearn import tree
>
> Thanks again as always !
>
> Cheers,
>
> On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître <
> g.lemaitre58 at gmail.com> wrote:
>
>> after the fit you need this call:
>> for idx_tree, tree in enumerate(clf.estimators_):
>>      export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>>
>>
>>
>> On 28 December 2016 at 20:25, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Hi Guillaume,
>>>                           With respect to the following point you
>>> mentioned:
>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>> e.export_graphviz.html
>>>
>>> I couldn't find a direct method for exporting the RandomForestClassifier
>>> trees. Accordingly, I attempted for a workaround using the following code
>>> but still no success:
>>>
>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>>> clf.fit(p_features_train,p_labels_train)
>>> for i, tree in enumerate(clf.estimators_):
>>>     with open('tree_' + str(i) + '.dot', 'w') as dotfile:
>>>          tree.export_graphviz(clf, dotfile)
>>>
>>> Would you please be able to help me with the piece of code which I need
>>> to execute for exporting the RandomForestClassifier trees.
>>>
>>> Cheers,
>>>
>>> Debu
>>>
>>>
>>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
>>> g.lemaitre58 at gmail.com> wrote:
>>>
>>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Joel, Andrew and Roman,
>>>>>                                                     Thank you very
>>>>> much for your individual feedback ! It's very helpful indeed ! A few more
>>>>> points related to my model execution:
>>>>>
>>>>> 1. By the term "scoring" I meant the process of executing the model
>>>>> once again without retraining it. So , for training the model I used
>>>>> RandomForestClassifer library and for my scoring (execution without
>>>>> retraining) I have used joblib.dump and joblib.load
>>>>>
>>>>
>>>> Go probably with the terms: training, validating, and testing.
>>>> This is pretty much standard. Scoring is just the value of a
>>>> metric given some data (training data, validation data, or
>>>> testing data).
>>>>
>>>>
>>>>>
>>>>> 2. I have used the parameter n_estimator = 5000 while training my
>>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other
>>>>> parameter
>>>>>
>>>>
>>>> You should probably check those other parameters and understand
>>>>  what are their effects. You should really check the link of Roman
>>>> since GridSearchCV can help you to decide how to fix the parameters.
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>>>> Additionally, 5000 trees seems a lot to me.
>>>>
>>>>
>>>>>
>>>>> 3. For my "scoring" activity (executing the model without retraining
>>>>> it) is there an alternate approach to joblib library ?
>>>>>
>>>>
>>>> Joblib only store data. There is not link with scoring (Check Roman
>>>> answer)
>>>>
>>>>
>>>>>
>>>>> 4. When I execute my scoring job (joblib method) on a dataset , which
>>>>> is completely different to my training dataset then I get similar True
>>>>> Positive Rate and False Positive Rate as of training
>>>>>
>>>>
>>>> It is what you should get.
>>>>
>>>>
>>>>>
>>>>> 5. However, when I execute my scoring job on the same dataset used for
>>>>> training my model then I get very high TPR and FPR.
>>>>>
>>>>
>>>> You are testing on some data which you used while training. Probably,
>>>> one of the first rule is to not do that. If you want to evaluate in some
>>>> way your classifier, have a separate set (test set) and only test on
>>>> that
>>>> one. As previously mentioned by Roman, 80% of your data are already
>>>> known by the RandomForestClassifier and will be perfectly classified.
>>>>
>>>>
>>>>>
>>>>>                                                   Is there mechanism
>>>>> through which I can visualise the trees created by my RandomForestClassifer
>>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>>>>> of .npy files created. Will those contain the trees ?
>>>>>
>>>>
>>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>>> e.export_graphviz.html
>>>>
>>>> The bunch of npy are the data needed to load the RandomForestClassifier
>>>> which
>>>> you previously dumped.
>>>>
>>>>
>>>>>
>>>>> Thanks in advance !
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Debu
>>>>>
>>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Your model is overfit to the training data. Not to say that it's
>>>>>> necessarily possible to get a better fit. The default settings for trees
>>>>>> lean towards a tight fit, so you might modify their parameters to increase
>>>>>> regularisation. Still, you should not expect that evaluating a model's
>>>>>> performance on its training data will be indicative of its general
>>>>>> performance. This is why we use held-out test sets and cross-validation.
>>>>>>
>>>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Debu,
>>>>>>>
>>>>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>>>>> >      5. I got a prediction result with True Positive Rate (TPR) as
>>>>>>> 10-12
>>>>>>> >         % on probability thresholds above 0.5
>>>>>>>
>>>>>>> Getting a high True Positive Rate (recall) is not a sufficient
>>>>>>> condition
>>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>>>>> could look at the precision at the same time (or consider, for
>>>>>>> instance,
>>>>>>> the F1 score).
>>>>>>>
>>>>>>> >      7. I reloaded the model in a different python instance from
>>>>>>> the
>>>>>>> >         pickle file mentioned above and did my scoring , i.e., used
>>>>>>> >         joblib library load method and then instantiated prediction
>>>>>>> >         (predict_proba method) on the entire set of my original
>>>>>>> 600 K
>>>>>>> >         records
>>>>>>> >               Another question – is there an alternate model
>>>>>>> scoring
>>>>>>> >     library (apart from joblib, the one I am using) ?
>>>>>>>
>>>>>>> Joblib is not a scoring library; once you load a model from disk with
>>>>>>> joblib you should get ~ the same RandomForestClassifier estimator
>>>>>>> object
>>>>>>> as before saving it.
>>>>>>>
>>>>>>> >      8. Now when I am running (scoring) my model using
>>>>>>> >         joblib.predict_proba on the entire set of original data
>>>>>>> (600 K),
>>>>>>> >         I am getting a True Positive rate of around 80%.
>>>>>>>
>>>>>>> That sounds normal, considering what you are doing. Your entire set
>>>>>>> consists of 80% of training set (for which the recall, I imagine,
>>>>>>> would
>>>>>>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>>>>>>> average you would get a recall close to 0.8 for the complete set.
>>>>>>> Unless
>>>>>>> I missed something.
>>>>>>>
>>>>>>>
>>>>>>> >      9. I did some  further analysis and figured out that during
>>>>>>> the
>>>>>>> >         training process, when the model was predicting on the test
>>>>>>> >         sample of 120K it could only predict 10-12% of 120K data
>>>>>>> beyond
>>>>>>> >         a probability threshold of 0.5. When I am now trying to
>>>>>>> score my
>>>>>>> >         model on the entire set of 600 K records, it appears that
>>>>>>> the
>>>>>>> >         model is remembering some of it’s past behavior and data
>>>>>>> and
>>>>>>> >         accordingly throwing 80% True positive rate
>>>>>>>
>>>>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>>>>> recall of 0.1 on the test set is quite low. It could be worth trying
>>>>>>> to
>>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using
>>>>>>> some
>>>>>>> other metric than the recall to evaluate the performance.
>>>>>>>
>>>>>>>
>>>>>>> Roman
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Lemaitre
>>>> INRIA Saclay - Ile-de-France
>>>> Equipe PARIETAL
>>>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>>>> https://glemaitre.github.io/
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Ile-de-France
>> Equipe PARIETAL
>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161228/562ed424/attachment-0001.html>


More information about the scikit-learn mailing list