[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library
Debabrata Ghosh
mailfordebu at gmail.com
Wed Dec 28 23:38:21 EST 2016
Hi Guillaume,
Thanks for your feedback ! I am still
getting an error, while attempting to print the trees. Here is a snapshot
of my code. I know I may be missing something very silly, but still wanted
to check and see how this works.
>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>>> clf.fit(p_features_train,p_labels_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> for idx_tree, tree in enumerate(clf.estimators_):
... export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
NameError: name 'export_graphviz' is not defined
>>> for idx_tree, tree in enumerate(clf.estimators_):
... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
AttributeError: 'DecisionTreeClassifier' object has no attribute
'export_graphviz'
Just to give you a background about the libraries, I have imported the
following libraries:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
Thanks again as always !
Cheers,
On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître <g.lemaitre58 at gmail.com>
wrote:
> after the fit you need this call:
> for idx_tree, tree in enumerate(clf.estimators_):
> export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>
>
>
> On 28 December 2016 at 20:25, Debabrata Ghosh <mailfordebu at gmail.com>
> wrote:
>
>> Hi Guillaume,
>> With respect to the following point you
>> mentioned:
>> You can visualize the trees with sklearn.tree.export_graphviz:
>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>> e.export_graphviz.html
>>
>> I couldn't find a direct method for exporting the RandomForestClassifier
>> trees. Accordingly, I attempted for a workaround using the following code
>> but still no success:
>>
>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>> clf.fit(p_features_train,p_labels_train)
>> for i, tree in enumerate(clf.estimators_):
>> with open('tree_' + str(i) + '.dot', 'w') as dotfile:
>> tree.export_graphviz(clf, dotfile)
>>
>> Would you please be able to help me with the piece of code which I need
>> to execute for exporting the RandomForestClassifier trees.
>>
>> Cheers,
>>
>> Debu
>>
>>
>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
>> g.lemaitre58 at gmail.com> wrote:
>>
>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>>> wrote:
>>>
>>>> Dear Joel, Andrew and Roman,
>>>> Thank you very
>>>> much for your individual feedback ! It's very helpful indeed ! A few more
>>>> points related to my model execution:
>>>>
>>>> 1. By the term "scoring" I meant the process of executing the model
>>>> once again without retraining it. So , for training the model I used
>>>> RandomForestClassifer library and for my scoring (execution without
>>>> retraining) I have used joblib.dump and joblib.load
>>>>
>>>
>>> Go probably with the terms: training, validating, and testing.
>>> This is pretty much standard. Scoring is just the value of a
>>> metric given some data (training data, validation data, or
>>> testing data).
>>>
>>>
>>>>
>>>> 2. I have used the parameter n_estimator = 5000 while training my
>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other
>>>> parameter
>>>>
>>>
>>> You should probably check those other parameters and understand
>>> what are their effects. You should really check the link of Roman
>>> since GridSearchCV can help you to decide how to fix the parameters.
>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>>> Additionally, 5000 trees seems a lot to me.
>>>
>>>
>>>>
>>>> 3. For my "scoring" activity (executing the model without retraining
>>>> it) is there an alternate approach to joblib library ?
>>>>
>>>
>>> Joblib only store data. There is not link with scoring (Check Roman
>>> answer)
>>>
>>>
>>>>
>>>> 4. When I execute my scoring job (joblib method) on a dataset , which
>>>> is completely different to my training dataset then I get similar True
>>>> Positive Rate and False Positive Rate as of training
>>>>
>>>
>>> It is what you should get.
>>>
>>>
>>>>
>>>> 5. However, when I execute my scoring job on the same dataset used for
>>>> training my model then I get very high TPR and FPR.
>>>>
>>>
>>> You are testing on some data which you used while training. Probably,
>>> one of the first rule is to not do that. If you want to evaluate in some
>>> way your classifier, have a separate set (test set) and only test on that
>>> one. As previously mentioned by Roman, 80% of your data are already
>>> known by the RandomForestClassifier and will be perfectly classified.
>>>
>>>
>>>>
>>>> Is there mechanism
>>>> through which I can visualise the trees created by my RandomForestClassifer
>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>>>> of .npy files created. Will those contain the trees ?
>>>>
>>>
>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>> e.export_graphviz.html
>>>
>>> The bunch of npy are the data needed to load the RandomForestClassifier
>>> which
>>> you previously dumped.
>>>
>>>
>>>>
>>>> Thanks in advance !
>>>>
>>>> Cheers,
>>>>
>>>> Debu
>>>>
>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>>>> wrote:
>>>>
>>>>> Your model is overfit to the training data. Not to say that it's
>>>>> necessarily possible to get a better fit. The default settings for trees
>>>>> lean towards a tight fit, so you might modify their parameters to increase
>>>>> regularisation. Still, you should not expect that evaluating a model's
>>>>> performance on its training data will be indicative of its general
>>>>> performance. This is why we use held-out test sets and cross-validation.
>>>>>
>>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Debu,
>>>>>>
>>>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>>>> > 5. I got a prediction result with True Positive Rate (TPR) as
>>>>>> 10-12
>>>>>> > % on probability thresholds above 0.5
>>>>>>
>>>>>> Getting a high True Positive Rate (recall) is not a sufficient
>>>>>> condition
>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>>>> could look at the precision at the same time (or consider, for
>>>>>> instance,
>>>>>> the F1 score).
>>>>>>
>>>>>> > 7. I reloaded the model in a different python instance from the
>>>>>> > pickle file mentioned above and did my scoring , i.e., used
>>>>>> > joblib library load method and then instantiated prediction
>>>>>> > (predict_proba method) on the entire set of my original 600
>>>>>> K
>>>>>> > records
>>>>>> > Another question – is there an alternate model scoring
>>>>>> > library (apart from joblib, the one I am using) ?
>>>>>>
>>>>>> Joblib is not a scoring library; once you load a model from disk with
>>>>>> joblib you should get ~ the same RandomForestClassifier estimator
>>>>>> object
>>>>>> as before saving it.
>>>>>>
>>>>>> > 8. Now when I am running (scoring) my model using
>>>>>> > joblib.predict_proba on the entire set of original data
>>>>>> (600 K),
>>>>>> > I am getting a True Positive rate of around 80%.
>>>>>>
>>>>>> That sounds normal, considering what you are doing. Your entire set
>>>>>> consists of 80% of training set (for which the recall, I imagine,
>>>>>> would
>>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on
>>>>>> average you would get a recall close to 0.8 for the complete set.
>>>>>> Unless
>>>>>> I missed something.
>>>>>>
>>>>>>
>>>>>> > 9. I did some further analysis and figured out that during the
>>>>>> > training process, when the model was predicting on the test
>>>>>> > sample of 120K it could only predict 10-12% of 120K data
>>>>>> beyond
>>>>>> > a probability threshold of 0.5. When I am now trying to
>>>>>> score my
>>>>>> > model on the entire set of 600 K records, it appears that
>>>>>> the
>>>>>> > model is remembering some of it’s past behavior and data and
>>>>>> > accordingly throwing 80% True positive rate
>>>>>>
>>>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>>>> recall of 0.1 on the test set is quite low. It could be worth trying
>>>>>> to
>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using
>>>>>> some
>>>>>> other metric than the recall to evaluate the performance.
>>>>>>
>>>>>>
>>>>>> Roman
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Lemaitre
>>> INRIA Saclay - Ile-de-France
>>> Equipe PARIETAL
>>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>>> https://glemaitre.github.io/
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Ile-de-France
> Equipe PARIETAL
> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
> https://glemaitre.github.io/
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161229/d817bedd/attachment-0001.html>
More information about the scikit-learn
mailing list