[scikit-learn] Random Forest max_features and boostrap construction parameters interpretation

Mon Jun 5 13:54:57 EDT 2017

Howdy

When doing bootstrapping, n samples are selected from the dataset WITH
replacement, where n is the number of samples in the dataset. This leads to
situations where some samples have a weight > 1 and others have a weight of
0. This is done separately for each tree.

When selecting the number of features, this should be considered more like
`max_informative_features.` Essentially, if a tree considers splitting a
feature that is constant, that won't count against the `max_features`
threshold that is set. This helps guard against situations where many
uninformative trees are built because the dataset is full of uninformative
features. You can see this in the code here: (
https://github.com/scikit-learn/scikit-learn/blob/14031f65d144e3966113d3daec836e443c6d7a5b/sklearn/tree/_splitter.pyx#L361).
This is done on a -per split basis-, meaning that a tree can have more than
`max_features` number of features considered.

In your example, it is not that there would be at most 20 splits in a tree,
it is that at each split only 20 informative features would be considered.
You can split on a feature multiple times (consider the example where you
have one features and x < 0 is class 0, 0 <= x <= 10 is class 1, and x > 10
is class 0 again).

Let me know if you have any other questions!

On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
wrote:

> Dear community,
>
> This is a question regarding how to interpret the documentation and
> semantics of the random forest constructors.
>
> In forest.py (of version 0.17 which I am still using), the documentation
> regarding the number of features to consider states on lines 742-745 of the
> source code that the search may effectively inspect more than
> `max_features` when determining the features to pick from in order to split
> a node.
> It also states that it is tree specific.
>
> Am I correct in:
>
> Interpretation #1 - For bootstrap=True, sampling with replacement occurs
> for the number of training instances available, meaning that the subsample
> presented to a particular tree will have some probability of containing
> overlaps and therefore not the full input training set, but for
> bootstrap=False, the entire dataset will be presented to each tree?
>
> Interpretation #2 - Particularly, with the way I interpret the
> documentation stating that "The sub-sample size is always the same as the
> original input sample size...", it seems to me that bootstrap=False then
> provides the entire training dataset to each decision tree, and it is a
> matter of which feature was randomly selected first from the features given
> that determines what the tree will become.
> That would suggest that, if bootstrap=False, and if the number of trees is
> high but the feature dimensionality is very low, then there is a high
> possibility that multiple copies of the same tree will emerge from the
> forest.
>
> Interpretation #3 - the feature subset is not subsampled per tree, but
> rather all features are presented for the subsampled training data provided
> to a tree ?  For example, if the dimensionality is 400 on a 6000-input
> training dataset that has randomly been subsampled (with bootstrap=True) to
> yield 4700 unique training samples, then the tree builder will consider all
> 400 dimensions/features with respect to the 4700 samples, picking at most
> `max_features` number of features (out of 400) for building splits in the
> tree?  So by default (sqrt/auto), there would be at most 20 splits in the
> tree?
>
> Confirmations, denials, and corrections to my interpretations are _highly_
> welcome.
>
> As always, my great thanks to the community.
>
> With kind regards,
> J.B. Brown
> Kyoto University Graduate School of Medicine
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170605/f90e0ed1/attachment.html>