[scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library
Guillaume Lemaître
g.lemaitre58 at gmail.com
Wed Dec 28 14:34:43 EST 2016
after the fit you need this call:
for idx_tree, tree in enumerate(clf.estimators_):
export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
On 28 December 2016 at 20:25, Debabrata Ghosh <mailfordebu at gmail.com> wrote:
> Hi Guillaume,
> With respect to the following point you
> mentioned:
> You can visualize the trees with sklearn.tree.export_graphviz:
> http://scikit-learn.org/stable/modules/generated/sklearn.tre
> e.export_graphviz.html
>
> I couldn't find a direct method for exporting the RandomForestClassifier
> trees. Accordingly, I attempted for a workaround using the following code
> but still no success:
>
> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
> clf.fit(p_features_train,p_labels_train)
> for i, tree in enumerate(clf.estimators_):
> with open('tree_' + str(i) + '.dot', 'w') as dotfile:
> tree.export_graphviz(clf, dotfile)
>
> Would you please be able to help me with the piece of code which I need to
> execute for exporting the RandomForestClassifier trees.
>
> Cheers,
>
> Debu
>
>
> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
> g.lemaitre58 at gmail.com> wrote:
>
>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Dear Joel, Andrew and Roman,
>>> Thank you very much
>>> for your individual feedback ! It's very helpful indeed ! A few more points
>>> related to my model execution:
>>>
>>> 1. By the term "scoring" I meant the process of executing the model once
>>> again without retraining it. So , for training the model I used
>>> RandomForestClassifer library and for my scoring (execution without
>>> retraining) I have used joblib.dump and joblib.load
>>>
>>
>> Go probably with the terms: training, validating, and testing.
>> This is pretty much standard. Scoring is just the value of a
>> metric given some data (training data, validation data, or
>> testing data).
>>
>>
>>>
>>> 2. I have used the parameter n_estimator = 5000 while training my model.
>>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>>>
>>
>> You should probably check those other parameters and understand
>> what are their effects. You should really check the link of Roman
>> since GridSearchCV can help you to decide how to fix the parameters.
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>> Additionally, 5000 trees seems a lot to me.
>>
>>
>>>
>>> 3. For my "scoring" activity (executing the model without retraining it)
>>> is there an alternate approach to joblib library ?
>>>
>>
>> Joblib only store data. There is not link with scoring (Check Roman
>> answer)
>>
>>
>>>
>>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>>> completely different to my training dataset then I get similar True
>>> Positive Rate and False Positive Rate as of training
>>>
>>
>> It is what you should get.
>>
>>
>>>
>>> 5. However, when I execute my scoring job on the same dataset used for
>>> training my model then I get very high TPR and FPR.
>>>
>>
>> You are testing on some data which you used while training. Probably,
>> one of the first rule is to not do that. If you want to evaluate in some
>> way your classifier, have a separate set (test set) and only test on that
>> one. As previously mentioned by Roman, 80% of your data are already
>> known by the RandomForestClassifier and will be perfectly classified.
>>
>>
>>>
>>> Is there mechanism
>>> through which I can visualise the trees created by my RandomForestClassifer
>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>>> of .npy files created. Will those contain the trees ?
>>>
>>
>> You can visualize the trees with sklearn.tree.export_graphviz:
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> tree.export_graphviz.html
>>
>> The bunch of npy are the data needed to load the RandomForestClassifier
>> which
>> you previously dumped.
>>
>>
>>>
>>> Thanks in advance !
>>>
>>> Cheers,
>>>
>>> Debu
>>>
>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> Your model is overfit to the training data. Not to say that it's
>>>> necessarily possible to get a better fit. The default settings for trees
>>>> lean towards a tight fit, so you might modify their parameters to increase
>>>> regularisation. Still, you should not expect that evaluating a model's
>>>> performance on its training data will be indicative of its general
>>>> performance. This is why we use held-out test sets and cross-validation.
>>>>
>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurchak at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Debu,
>>>>>
>>>>> On 27/12/16 08:18, Andrew Howe wrote:
>>>>> > 5. I got a prediction result with True Positive Rate (TPR) as
>>>>> 10-12
>>>>> > % on probability thresholds above 0.5
>>>>>
>>>>> Getting a high True Positive Rate (recall) is not a sufficient
>>>>> condition
>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>>>>> could look at the precision at the same time (or consider, for
>>>>> instance,
>>>>> the F1 score).
>>>>>
>>>>> > 7. I reloaded the model in a different python instance from the
>>>>> > pickle file mentioned above and did my scoring , i.e., used
>>>>> > joblib library load method and then instantiated prediction
>>>>> > (predict_proba method) on the entire set of my original 600 K
>>>>> > records
>>>>> > Another question – is there an alternate model scoring
>>>>> > library (apart from joblib, the one I am using) ?
>>>>>
>>>>> Joblib is not a scoring library; once you load a model from disk with
>>>>> joblib you should get ~ the same RandomForestClassifier estimator
>>>>> object
>>>>> as before saving it.
>>>>>
>>>>> > 8. Now when I am running (scoring) my model using
>>>>> > joblib.predict_proba on the entire set of original data (600
>>>>> K),
>>>>> > I am getting a True Positive rate of around 80%.
>>>>>
>>>>> That sounds normal, considering what you are doing. Your entire set
>>>>> consists of 80% of training set (for which the recall, I imagine, would
>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on
>>>>> average you would get a recall close to 0.8 for the complete set.
>>>>> Unless
>>>>> I missed something.
>>>>>
>>>>>
>>>>> > 9. I did some further analysis and figured out that during the
>>>>> > training process, when the model was predicting on the test
>>>>> > sample of 120K it could only predict 10-12% of 120K data
>>>>> beyond
>>>>> > a probability threshold of 0.5. When I am now trying to
>>>>> score my
>>>>> > model on the entire set of 600 K records, it appears that the
>>>>> > model is remembering some of it’s past behavior and data and
>>>>> > accordingly throwing 80% True positive rate
>>>>>
>>>>> It feels like your RandomForestClassifier is not properly tuned. A
>>>>> recall of 0.1 on the test set is quite low. It could be worth trying to
>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>>>>> other metric than the recall to evaluate the performance.
>>>>>
>>>>>
>>>>> Roman
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Ile-de-France
>> Equipe PARIETAL
>> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
Guillaume Lemaitre
INRIA Saclay - Ile-de-France
Equipe PARIETAL
guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161228/464a9edd/attachment-0001.html>
More information about the scikit-learn
mailing list