[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

Thu Dec 29 00:00:56 EST 2016

Thanks Naoya ! This has worked and I am able to generate the .dot files.

Cheers,

Debu

On Thu, Dec 29, 2016 at 10:20 AM, Naoya Kanai <naopon at gmail.com> wrote:

> The ‘tree’ name is clashing between the sklearn.tree module and the
> DecisionTreeClassifier objects in the loop.
>
> You can change the import to
>
> from sklearn.tree import export_graphviz
>
> and modify the method call accordingly.
> 
>
> On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh <mailfordebu at gmail.com>
> wrote:
>
>> Hi Guillaume,
>>                                       Thanks for your feedback ! I am
>> still getting an error, while attempting to print the trees. Here is a
>> snapshot of my code. I know I may be missing something very silly, but
>> still wanted to check and see how this works.
>>
>> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>> >>> clf.fit(p_features_train,p_labels_train)
>> RandomForestClassifier(bootstrap=True, class_weight=None,
>> criterion='gini',
>>             max_depth=None, max_features='auto', max_leaf_nodes=None,
>>             min_samples_leaf=1, min_samples_split=2,
>>             min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
>>             oob_score=False, random_state=None, verbose=0,
>>             warm_start=False)
>> >>> for idx_tree, tree in enumerate(clf.estimators_):
>> ...     export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>> ...
>> Traceback (most recent call last):
>>   File "<stdin>", line 2, in <module>
>> NameError: name 'export_graphviz' is not defined
>> >>> for idx_tree, tree in enumerate(clf.estimators_):
>> ...     tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>> ...
>> Traceback (most recent call last):
>>   File "<stdin>", line 2, in <module>
>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>> 'export_graphviz'
>>
>> Just to give you  a background about the libraries, I have imported the
>> following libraries:
>>
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn import tree
>>
>> Thanks again as always !
>>
>> Cheers,
>>
>> On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître <
>> g.lemaitre58 at gmail.com> wrote:
>>
>>> after the fit you need this call:
>>> for idx_tree, tree in enumerate(clf.estimators_):
>>>      export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>>>
>>>
>>>
>>> On 28 December 2016 at 20:25, Debabrata Ghosh <mailfordebu at gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>                           With respect to the following point you
>>>> mentioned:
>>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>>> e.export_graphviz.html
>>>>
>>>> I couldn't find a direct method for exporting the
>>>> RandomForestClassifier trees. Accordingly, I attempted for a workaround
>>>> using the following code but still no success:
>>>>
>>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>>>> clf.fit(p_features_train,p_labels_train)
>>>> for i, tree in enumerate(clf.estimators_):
>>>>     with open('tree_' + str(i) + '.dot', 'w') as dotfile:
>>>>          tree.export_graphviz(clf, dotfile)
>>>>
>>>> Would you please be able to help me with the piece of code which I need
>>>> to execute for exporting the RandomForestClassifier trees.
>>>>
>>>> Cheers,
>>>>
>>>> Debu
>>>>
>>>>
>>>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
>>>> g.lemaitre58 at gmail.com> wrote:
>>>>
>>>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Joel, Andrew and Roman,
>>>>>>                                                     Thank you very
>>>>>> much for your individual feedback ! It's very helpful indeed ! A few more
>>>>>> points related to my model execution:
>>>>>>
>>>>>> 1. By the term "scoring" I meant the process of executing the model
>>>>>> once again without retraining it. So , for training the model I used
>>>>>> RandomForestClassifer library and for my scoring (execution without
>>>>>> retraining) I have used joblib.dump and joblib.load
>>>>>>
>>>>>
>>>>> Go probably with the terms: training, validating, and testing.
>>>>> This is pretty much standard. Scoring is just the value of a
>>>>> metric given some data (training data, validation data, or
>>>>> testing data).
>>>>>
>>>>>
>>>>>>
>>>>>> 2. I have used the parameter n_estimator = 5000 while training my
>>>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other
>>>>>> parameter
>>>>>>
>>>>>
>>>>> You should probably check those other parameters and understand
>>>>>  what are their effects. You should really check the link of Roman
>>>>> since GridSearchCV can help you to decide how to fix the parameters.
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>>>>> Additionally, 5000 trees seems a lot to me.
>>>>>
>>>>>
>>>>>>
>>>>>> 3. For my "scoring" activity (executing the model without retraining
>>>>>> it) is there an alternate approach to joblib library ?
>>>>>>
>>>>>
>>>>> Joblib only store data. There is not link with scoring (Check Roman
>>>>> answer)
>>>>>
>>>>>
>>>>>>
>>>>>> 4. When I execute my scoring job (joblib method) on a dataset , which
>>>>>> is completely different to my training dataset then I get similar True
>>>>>> Positive Rate and False Positive Rate as of training
>>>>>>
>>>>>
>>>>> It is what you should get.
>>>>>
>>>>>
>>>>>>
>>>>>> 5. However, when I execute my scoring job on the same dataset used
>>>>>> for training my model then I get very high TPR and FPR.
>>>>>>
>>>>>
>>>>> You are testing on some data which you used while training. Probably,
>>>>> one of the first rule is to not do that. If you want to evaluate in
>>>>> some
>>>>> way your classifier, have a separate set (test set) and only test on
>>>>> that
>>>>> one. As previously mentioned by Roman, 80% of your data are already
>>>>> known by the RandomForestClassifier and will be perfectly classified.
>>>>>
>>>>>
>>>>>>
>>>>>>                                                   Is there mechanism
>>>>>> through which I can visualise the trees created by my RandomForestClassifer
>>>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>>>>>> of .npy files created. Will those contain the trees ?
>>>>>>
>>>>>
>>>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>>>> e.export_graphviz.html
>>>>>
>>>>> The bunch of npy are the data needed to load the
>>>>> RandomForestClassifier which
>>>>> you previously dumped.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks in advance !
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Debu
>>>>>>
>>>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Your model is overfit to the training data. Not to say that it's
>>>>>>> necessarily possible to get a better fit. The default settings for trees
>>>>>>> lean towards a tight fit, so you might modify their parameters to increase
>>>>>>> regularisation. Still, you should not expect that evaluating a model's
>>>>>>> performance on its training data will be indicative of its general
>>>>>>> performance. This is why we use held-out test sets and cross-validation.
>>>>>>>
>>>>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Debu,
>>>>>>>>
>>>>>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>>>>>> >      5. I got a prediction result with True Positive Rate (TPR)
>>>>>>>> as 10-12
>>>>>>>> >         % on probability thresholds above 0.5
>>>>>>>>
>>>>>>>> Getting a high True Positive Rate (recall) is not a sufficient
>>>>>>>> condition
>>>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>>>>>> could look at the precision at the same time (or consider, for
>>>>>>>> instance,
>>>>>>>> the F1 score).
>>>>>>>>
>>>>>>>> >      7. I reloaded the model in a different python instance from
>>>>>>>> the
>>>>>>>> >         pickle file mentioned above and did my scoring , i.e.,
>>>>>>>> used
>>>>>>>> >         joblib library load method and then instantiated
>>>>>>>> prediction
>>>>>>>> >         (predict_proba method) on the entire set of my original
>>>>>>>> 600 K
>>>>>>>> >         records
>>>>>>>> >               Another question – is there an alternate model
>>>>>>>> scoring
>>>>>>>> >     library (apart from joblib, the one I am using) ?
>>>>>>>>
>>>>>>>> Joblib is not a scoring library; once you load a model from disk
>>>>>>>> with
>>>>>>>> joblib you should get ~ the same RandomForestClassifier estimator
>>>>>>>> object
>>>>>>>> as before saving it.
>>>>>>>>
>>>>>>>> >      8. Now when I am running (scoring) my model using
>>>>>>>> >         joblib.predict_proba on the entire set of original data
>>>>>>>> (600 K),
>>>>>>>> >         I am getting a True Positive rate of around 80%.
>>>>>>>>
>>>>>>>> That sounds normal, considering what you are doing. Your entire set
>>>>>>>> consists of 80% of training set (for which the recall, I imagine,
>>>>>>>> would
>>>>>>>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>>>>>>>> average you would get a recall close to 0.8 for the complete set.
>>>>>>>> Unless
>>>>>>>> I missed something.
>>>>>>>>
>>>>>>>>
>>>>>>>> >      9. I did some  further analysis and figured out that during
>>>>>>>> the
>>>>>>>> >         training process, when the model was predicting on the
>>>>>>>> test
>>>>>>>> >         sample of 120K it could only predict 10-12% of 120K data
>>>>>>>> beyond
>>>>>>>> >         a probability threshold of 0.5. When I am now trying to
>>>>>>>> score my
>>>>>>>> >         model on the entire set of 600 K records, it appears that
>>>>>>>> the
>>>>>>>> >         model is remembering some of it’s past behavior and data
>>>>>>>> and
>>>>>>>> >         accordingly throwing 80% True positive rate
>>>>>>>>
>>>>>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>>>>>> recall of 0.1 on the test set is quite low. It could be worth
>>>>>>>> trying to
>>>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using
>>>>>>>> some
>>>>>>>> other metric than the recall to evaluate the performance.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roman
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> scikit-learn at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Lemaitre
>>>>> INRIA Saclay - Ile-de-France
>>>>> Equipe PARIETAL
>>>>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>>>>> https://glemaitre.github.io/
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Lemaitre
>>> INRIA Saclay - Ile-de-France
>>> Equipe PARIETAL
>>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>>> https://glemaitre.github.io/
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161229/5e52f355/attachment-0001.html>