[scikit-learn] impurity criterion in gradient boosted regression trees
Olga Lyashevska
o.lyashevskaya at gmail.com
Tue May 9 11:34:07 EDT 2017
Hi all,
I am trying to understand differences in feature importance plots
obtained with R package gbm and sklearn. Having compared both
implementation side by side it seems that the models are fairly similar,
however feature importance plots are rather distinct.
R uses empirical improvement in squared error as it is described in
Friedmans's "Greedy Function Approximation" paper (eq. 44, 45).
sklearn (as far as I could see it in the code) uses the weighted
reduction in node purity. How exactly is this calculated? Is it a gini
index? Is there a reference?
I found this, but I find this hard to follow:
https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056
I have also seen a post by Matthew Drury on stack exchange:
https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting
Many thanks,
Olga
More information about the scikit-learn
mailing list