[scikit-learn] Random Forest with Bootstrapping

Mon Oct 3 16:28:36 EDT 2016

Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement.

For more details, I’d recommend reading the original literature, e.g,. 

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1–26. 

There’s also a whole book on this topic:

Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall.

Or more relevant to this particular application, maybe see 

Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

"Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy."

> On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here.
> 
> On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> 
> Yeah, you'll definitely have duplications, that’s why (if you have an infinitely large n) only 0.632*n samples are unique ;).
> 
> Say your dataset is
> 
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points)
> 
> then a bootstrap sample could be
> 
> [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8]
> 
> 
> > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> >
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> >
> > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
> >
> > import matplotlib.pyplot as plt
> >
> > vs = []
> > for n in range(5, 201, 5):
> >     v = 1 - (1. - 1./n)**n
> >     vs.append(v)
> >
> > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> >           markersize=6,
> >           alpha=0.5,)
> >
> > plt.xlabel('n')
> > plt.ylabel('1 - (1 - 1/n)^n')
> > plt.xlim([0, 210])
> > plt.show()
> >
> > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > >
> > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
> > >
> > > P(not_chosen) = (1 - 1\n)^n
> > >
> > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.
> > >
> > > This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)
> > >
> > > Then, you can compute the probability of a sample being chosen as
> > >
> > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > >
> > > Best,
> > > Sebastian
> > >
> > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Thank you for the reply. Please bear with me for a while.
> > >>
> > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
> > >>
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>
> > >> Yes, that should be correct!
> > >>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>
> > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
> > >>
> > >> Best,
> > >> Sebastian
> > >>
> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> > >>>
> > >>> Dear Developers,
> > >>>
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>>
> > >>> (Note: Please do correct me if I am not making any sense.)
> > >>>
> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following
> > >>>
> > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> > >>>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>>
> > >>> Help this mere mortal.
> > >>>
> > >>> Thanks
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn at python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn