[scikit-learn] Random Forest with Bootstrapping

Sebastian Raschka se.raschka at gmail.com
Mon Oct 3 14:32:52 EDT 2016


> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.

Yes, that should be correct!

> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)

If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).

Best,
Sebastian

> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> Dear Developers,
> 
> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> 
> (Note: Please do correct me if I am not making any sense.)
> 
> RandomForestClassifier has an option of 'bootstrap'. The API states the following
>  
> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> 
> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> 
> Help this mere mortal.
> 
> Thanks
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list