Thanks for sharing your comments about this, Piotr.
Hi
Doug,
I modified your code a little bit to calculate the
feature_importances of every tree of the forest.
In my opinion these feature
importances should also sum to 1.0.
Since I could not access each
DecisionTreeRegressor of your GradientBoositngRegressor, I created a new
ExtraTreeRegressor.
This is a bit off topic, but does anyone have an
idea, why
type(ExtraTreesRegressor().estimators_)
results in a list and
type(GradientBoostingRegressor().estimators_)
results in an
np.array?
Anyway, here is the code:
import numpy as np
from
sklearn import datasets
from sklearn.ensemble import
GradientBoostingRegressor, ExtraTreesRegressor
boston =
datasets.load_boston()
X, Y = (boston.data,
boston.target)
n_estimators = 712
# Note: From 712
onwards, the feature importance sum is less than 1
params = {'n_estimators':
n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf =
GradientBoostingRegressor(**params)
clf.fit(X,
Y)
feature_importance_sum =
np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature
importance sum = %.20f" % (n_estimators ,
feature_importance_sum)
n_estimators_forest = 100
clf_forest =
ExtraTreesRegressor(n_estimators=n_estimators_forest)
clf_forest.fit(X,
Y)
feature_importance_sum_forest =
np.sum(clf_forest.feature_importances_)
forest_feat_imp =
[np.sum(tree.feature_importances_) for tree in clf_forest.estimators_]
print
"At n_estimators = %i, feature importance sum = %.20f" % (n_estimators_forest,
feature_importance_sum_forest)
for idx, imp in
enumerate(forest_feat_imp):
print "imp for tree %i: %.20f"
% (idx, imp)
I suppose in each tree there is a small rounding error,
summing up to the overall error.
So is this a bug or an inevitable rounding
issue?
Greets,
Piotr
On 09.09.2016 03:51, Douglas Chan wrote:
Hello everyone,
I’d like to bring this up again to see if people have any thoughts on
it.
If you also think this is a bug, then we can track it and get it
fixed. Please share your opinions.
Thank you,
-Doug
Sent: Wednesday, August 31, 2016 4:52 PM
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances
do not sum to 1
Thanks for your reply, Raphael.
Here’s some code using the Boston dataset to reproduce this.
=== START CODE ===
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor
boston = datasets.load_boston()
X, Y = (boston.data, boston.target)
n_estimators = 712
# Note: From 712 onwards, the feature importance sum is less than 1
params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate':
0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)
feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %f" % (n_estimators
, feature_importance_sum)
=== END CODE ===
If we deem this to be an error, I can open a bug to track it.
Please share your thoughts on it.
Thank you,
-Doug
Sent: Tuesday, August 30, 2016 11:28 PM
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances
do not sum to 1
Can
you provide a reproducible example?
Raphael
On Wednesday, August 31, 2016, Douglas Chan <
douglas.chan@ieee.org> wrote:
Hello everyone,
I notice conditions when Feature Importance values do not add up to 1
in ensemble tree methods, like Gradient Boosting Trees or AdaBoost
Trees. I wonder if there’s a bug in the code.
This error occurs when the ensemble has a large number of
estimators. The exact conditions depend variously. For example,
the error shows up sooner with a smaller amount of training samples.
Or, if the depth of the tree is large.
When this error appears, the predicted value seems to have
converged. But it’s unclear if the error is causing the predicted
value not to change with more estimators. In fact, the feature
importance sum goes lower and lower with more estimators thereafter.
I wonder if we’re hitting some floating point calculation error.
Looking forward to hear your thoughts on this.
Thank you!
-Doug
_______________________________________________
scikit-learn mailing
list
scikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing
list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn