[scikit-learn] Random Forest max_features and boostrap construction parameters interpretation
Jacob Schreiber
jmschreiber91 at gmail.com
Tue Jun 6 01:26:15 EDT 2017
Howdy
This documentation seems to be split between the RandomForestClassifier
<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier>
documentation that discusses sampling with replacement and the Ensemble
<http://scikit-learn.org/stable/modules/ensemble.html> documentation that
discusses the splits. I agree that it could be made more explicit. If you
want to build off of those to make sure that both discuss both features, I
would be happy to review it.
Jacob
On Mon, Jun 5, 2017 at 10:02 PM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
wrote:
> Dear Jacob,
>
> Thank you for this clarification. It is a great help in interpreting the
> (good) results that we are obtaining for computational chemogenomics, and
> also help in deciding directions of future studies.
>
> Perhaps then, the random forest documentation (description web page) could
> be updated to reflect our discussion, in that it might help others who have
> the same questions of interpretation.
>
> Perhaps, we can add the following (with notation symbols corrected to
> match sklearn standards):
> ----------
> In general, for a modeling problem with N training instances each having F
> features, a random forest of T trees operates by building T decision trees
> such that each tree is provided a subsampling of N instances and the F
> features for those subsampled instances.
>
> When bootstrapping is applied, the instance subsampling can potentially
> choose the same instance multiple times, in which such an instance will
> have elevated weighting.
> When bootstrapping is not applied, the entire training set is provided to
> the tree-building algorithm.
>
> Each tree is built by considering a maximum specified number of
> informative features at each decision node, such that features with no
> variance are excluded from the features to consider for a split and do not
> count toward the number of informative features.
> Splitting on the informative features can occur as many times as
> necessary, unless a maximum depth is specified in the constructor.
> Note that an informative feature can be re-applied to form a decision
> criteria at more than node in the decision tree.
> ----------
>
> Adjustments welcome.
>
> Many thanks again!
> J.B.
>
>
>
> 2017-06-06 2:54 GMT+09:00 Jacob Schreiber <jmschreiber91 at gmail.com>:
>
>> Howdy
>>
>> When doing bootstrapping, n samples are selected from the dataset WITH
>> replacement, where n is the number of samples in the dataset. This leads to
>> situations where some samples have a weight > 1 and others have a weight of
>> 0. This is done separately for each tree.
>>
>> When selecting the number of features, this should be considered more
>> like `max_informative_features.` Essentially, if a tree considers splitting
>> a feature that is constant, that won't count against the `max_features`
>> threshold that is set. This helps guard against situations where many
>> uninformative trees are built because the dataset is full of uninformative
>> features. You can see this in the code here: (
>> https://github.com/scikit-learn/scikit-learn/blob/14031f65d
>> 144e3966113d3daec836e443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This
>> is done on a -per split basis-, meaning that a tree can have more than
>> `max_features` number of features considered.
>>
>> In your example, it is not that there would be at most 20 splits in a
>> tree, it is that at each split only 20 informative features would be
>> considered. You can split on a feature multiple times (consider the example
>> where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1,
>> and x > 10 is class 0 again).
>>
>> Let me know if you have any other questions!
>>
>> On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
>> wrote:
>>
>>> Dear community,
>>>
>>> This is a question regarding how to interpret the documentation and
>>> semantics of the random forest constructors.
>>>
>>> In forest.py (of version 0.17 which I am still using), the documentation
>>> regarding the number of features to consider states on lines 742-745 of the
>>> source code that the search may effectively inspect more than
>>> `max_features` when determining the features to pick from in order to split
>>> a node.
>>> It also states that it is tree specific.
>>>
>>> Am I correct in:
>>>
>>> Interpretation #1 - For bootstrap=True, sampling with replacement occurs
>>> for the number of training instances available, meaning that the subsample
>>> presented to a particular tree will have some probability of containing
>>> overlaps and therefore not the full input training set, but for
>>> bootstrap=False, the entire dataset will be presented to each tree?
>>>
>>> Interpretation #2 - Particularly, with the way I interpret the
>>> documentation stating that "The sub-sample size is always the same as the
>>> original input sample size...", it seems to me that bootstrap=False then
>>> provides the entire training dataset to each decision tree, and it is a
>>> matter of which feature was randomly selected first from the features given
>>> that determines what the tree will become.
>>> That would suggest that, if bootstrap=False, and if the number of trees
>>> is high but the feature dimensionality is very low, then there is a high
>>> possibility that multiple copies of the same tree will emerge from the
>>> forest.
>>>
>>> Interpretation #3 - the feature subset is not subsampled per tree, but
>>> rather all features are presented for the subsampled training data provided
>>> to a tree ? For example, if the dimensionality is 400 on a 6000-input
>>> training dataset that has randomly been subsampled (with bootstrap=True) to
>>> yield 4700 unique training samples, then the tree builder will consider all
>>> 400 dimensions/features with respect to the 4700 samples, picking at most
>>> `max_features` number of features (out of 400) for building splits in the
>>> tree? So by default (sqrt/auto), there would be at most 20 splits in the
>>> tree?
>>>
>>> Confirmations, denials, and corrections to my interpretations are
>>> _highly_ welcome.
>>>
>>> As always, my great thanks to the community.
>>>
>>> With kind regards,
>>> J.B. Brown
>>> Kyoto University Graduate School of Medicine
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170605/2464e62d/attachment.html>
More information about the scikit-learn
mailing list