[scikit-learn] Breiman vs. scikit-learn definition of Feature Importance
Andreas Mueller
t3kcit at gmail.com
Wed May 16 13:27:40 EDT 2018
I don't think that's how most people use the trees, though.
Probably not even the ExtraTrees.
I really need to get around to reading your thesis :-/
Do you recommend using max_features=1 with ExtraTrees?
On 05/05/2018 05:21 AM, Gilles Louppe wrote:
> Hi,
>
> See also chapters 6 and 7 of http://arxiv.org/abs/1407.7502 for another
> point of view regarding the "issue" with feature importances. TLDR: Feature
> importances as we have them in scikit-learn (i.e. MDI) are provably **not**
> biased, provided trees are built totally at random (as in ExtraTrees with
> max_feature=1) and the depth is controlled min_samples_split (to avoid
> splitting on noise). On the other hand, it is not always clear what you
> actually compute with MDA (permutation based importances), since it is
> conditioned on the model you use.
>
> Gilles
> On Sat, 5 May 2018 at 10:36, Guillaume Lemaître <g.lemaitre58 at gmail.com>
> wrote:
>
>> +1 on the post pointed out by Jeremiah.
>> On 5 May 2018 at 02:08, Johnson, Jeremiah <Jeremiah.Johnson at unh.edu>
> wrote:
>
>>> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
>>> Best,
>>> Jeremiah
>>> =========================================
>>> Jeremiah W. Johnson, Ph.D
>>> Asst. Professor of Data Science
>>> Program Coordinator, B.S. in Analytics & Data Science
>>> University of New Hampshire
>>> Manchester, NH 03101
>>> https://www.linkedin.com/in/jwjohnson314
>>> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=
> unh.edu at python.org> on behalf of "Niyaghi, Faraz" <niyaghif at oregonstate.edu>
>>> Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
>>> Date: Friday, May 4, 2018 at 7:10 PM
>>> To: "scikit-learn at python.org" <scikit-learn at python.org>
>>> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
>>> Caution - External Email
>>> ________________________________
>>> Greetings,
>>> This is Faraz Niyaghi from Oregon State University. I research on
> variable selection using random forest. To the best of my knowledge, there
> is a difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
>>> Here are the definitions I found on the web:
>>> Breiman: "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
>>> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
>>> scikit-learn: " The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
>>> Link: http://scikit-learn.org/stable/modules/ensemble.html
>>> Thank you for reading this email. Please let me know your thoughts.
>>> Cheers,
>>> Faraz.
>>> Faraz Niyaghi
>>> Ph.D. Candidate, Department of Statistics
>>> Oregon State University
>>> Corvallis, OR
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list