[scikit-learn] Random Forest max_features and boostrap construction parameters interpretation
Brown J.B.
jbbrown at kuhp.kyoto-u.ac.jp
Tue Jun 6 01:02:21 EDT 2017
Dear Jacob,
Thank you for this clarification. It is a great help in interpreting the
(good) results that we are obtaining for computational chemogenomics, and
also help in deciding directions of future studies.
Perhaps then, the random forest documentation (description web page) could
be updated to reflect our discussion, in that it might help others who have
the same questions of interpretation.
Perhaps, we can add the following (with notation symbols corrected to match
sklearn standards):
----------
In general, for a modeling problem with N training instances each having F
features, a random forest of T trees operates by building T decision trees
such that each tree is provided a subsampling of N instances and the F
features for those subsampled instances.
When bootstrapping is applied, the instance subsampling can potentially
choose the same instance multiple times, in which such an instance will
have elevated weighting.
When bootstrapping is not applied, the entire training set is provided to
the tree-building algorithm.
Each tree is built by considering a maximum specified number of informative
features at each decision node, such that features with no variance are
excluded from the features to consider for a split and do not count toward
the number of informative features.
Splitting on the informative features can occur as many times as necessary,
unless a maximum depth is specified in the constructor.
Note that an informative feature can be re-applied to form a decision
criteria at more than node in the decision tree.
----------
Adjustments welcome.
Many thanks again!
J.B.
2017-06-06 2:54 GMT+09:00 Jacob Schreiber <jmschreiber91 at gmail.com>:
> Howdy
>
> When doing bootstrapping, n samples are selected from the dataset WITH
> replacement, where n is the number of samples in the dataset. This leads to
> situations where some samples have a weight > 1 and others have a weight of
> 0. This is done separately for each tree.
>
> When selecting the number of features, this should be considered more like
> `max_informative_features.` Essentially, if a tree considers splitting a
> feature that is constant, that won't count against the `max_features`
> threshold that is set. This helps guard against situations where many
> uninformative trees are built because the dataset is full of uninformative
> features. You can see this in the code here: (https://github.com/scikit-
> learn/scikit-learn/blob/14031f65d144e3966113d3daec836e
> 443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This is done on a -per split
> basis-, meaning that a tree can have more than `max_features` number of
> features considered.
>
> In your example, it is not that there would be at most 20 splits in a
> tree, it is that at each split only 20 informative features would be
> considered. You can split on a feature multiple times (consider the example
> where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1,
> and x > 10 is class 0 again).
>
> Let me know if you have any other questions!
>
> On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
> wrote:
>
>> Dear community,
>>
>> This is a question regarding how to interpret the documentation and
>> semantics of the random forest constructors.
>>
>> In forest.py (of version 0.17 which I am still using), the documentation
>> regarding the number of features to consider states on lines 742-745 of the
>> source code that the search may effectively inspect more than
>> `max_features` when determining the features to pick from in order to split
>> a node.
>> It also states that it is tree specific.
>>
>> Am I correct in:
>>
>> Interpretation #1 - For bootstrap=True, sampling with replacement occurs
>> for the number of training instances available, meaning that the subsample
>> presented to a particular tree will have some probability of containing
>> overlaps and therefore not the full input training set, but for
>> bootstrap=False, the entire dataset will be presented to each tree?
>>
>> Interpretation #2 - Particularly, with the way I interpret the
>> documentation stating that "The sub-sample size is always the same as the
>> original input sample size...", it seems to me that bootstrap=False then
>> provides the entire training dataset to each decision tree, and it is a
>> matter of which feature was randomly selected first from the features given
>> that determines what the tree will become.
>> That would suggest that, if bootstrap=False, and if the number of trees
>> is high but the feature dimensionality is very low, then there is a high
>> possibility that multiple copies of the same tree will emerge from the
>> forest.
>>
>> Interpretation #3 - the feature subset is not subsampled per tree, but
>> rather all features are presented for the subsampled training data provided
>> to a tree ? For example, if the dimensionality is 400 on a 6000-input
>> training dataset that has randomly been subsampled (with bootstrap=True) to
>> yield 4700 unique training samples, then the tree builder will consider all
>> 400 dimensions/features with respect to the 4700 samples, picking at most
>> `max_features` number of features (out of 400) for building splits in the
>> tree? So by default (sqrt/auto), there would be at most 20 splits in the
>> tree?
>>
>> Confirmations, denials, and corrections to my interpretations are
>> _highly_ welcome.
>>
>> As always, my great thanks to the community.
>>
>> With kind regards,
>> J.B. Brown
>> Kyoto University Graduate School of Medicine
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170606/8b5d227c/attachment-0001.html>
More information about the scikit-learn
mailing list