[scikit-learn] Random Forest with Bootstrapping

Tue Oct 4 06:44:06 EDT 2016

Hi,

So why is using a bootstrap sample of size n better than just a random set
of size 0.62*n in Random Forest?

Thanks

On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Originally, it was this technique was used to estimate a sampling
> distribution. Think of the drawing with replacement as work-around for
> generating *new* data from a population that is simulated by this repeated
> sampling from the given dataset with replacement.
>
>
> For more details, I’d recommend reading the original literature, e.g,.
>
> Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.”
> The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1–26.
>
>
> There’s also a whole book on this topic:
>
> Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the
> Bootstrap. Chapman & Hall.
>
>
> Or more relevant to this particular application, maybe see
>
> Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.
>
> "Tests on real and simulated data sets using classification and regression
> trees and subset selection in linear regression show that bagging can give
> substantial gains in accuracy. The vital element is the instability of the
> prediction method. If perturbing the learning set can cause significant
> changes in the predictor constructed, then bagging can improve accuracy."
>
>
> > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > So what is the point of having duplicate entries in your training set?
> This seems just a pure overhead. Sorry but you will again have to help me
> here.
> >
> > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> >
> > Yeah, you'll definitely have duplications, that’s why (if you have an
> infinitely large n) only 0.632*n samples are unique ;).
> >
> > Say your dataset is
> >
> > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices
> of your data points)
> >
> > then a bootstrap sample could be
> >
> > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently
> [2, 3, 6, 8]
> >
> >
> > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > >
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> > >
> > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> > > Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
> > >
> > > import matplotlib.pyplot as plt
> > >
> > > vs = []
> > > for n in range(5, 201, 5):
> > >     v = 1 - (1. - 1./n)**n
> > >     vs.append(v)
> > >
> > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> > >           markersize=6,
> > >           alpha=0.5,)
> > >
> > > plt.xlabel('n')
> > > plt.ylabel('1 - (1 - 1/n)^n')
> > > plt.xlim([0, 210])
> > > plt.show()
> > >
> > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > > >
> > > > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> > > >
> > > > P(not_chosen) = (1 - 1\n)^n
> > > >
> > > > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> > > >
> > > > This is asymptotically "1/e approx. 0.368” (i.e., for very, very
> large n)
> > > >
> > > > Then, you can compute the probability of a sample being chosen as
> > > >
> > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> Thank you for the reply. Please bear with me for a while.
> > > >>
> > > >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> > > >>
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> > > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > > >>
> > > >> Yes, that should be correct!
> > > >>
> > > >>> Now, what I am not able to understand is - if entire dataset is
> used to train each of the trees, then how does the classifier estimates the
> OOB error? None of the entries of the dataset is an oob for any of the
> trees. (Pardon me if all this sounds BS)
> > > >>
> > > >> If you take an n-size bootstrap sample, where n is the number of
> samples in your dataset, you have asymptotically 0.632 * n unique samples
> in your bootstrap set. Or in other words 0.368 * n samples are not used for
> growing the respective tree (to compute the OOB). As far as I understand,
> the random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
> > > >>
> > > >> Best,
> > > >> Sebastian
> > > >>
> > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > > >>>
> > > >>> Dear Developers,
> > > >>>
> > > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > > >>>
> > > >>> (Note: Please do correct me if I am not making any sense.)
> > > >>>
> > > >>> RandomForestClassifier has an option of 'bootstrap'. The API
> states the following
> > > >>>
> > > >>> The sub-sample size is always the same as the original input
> sample size but the samples are drawn with replacement if bootstrap=True
> (default).
> > > >>>
> > > >>> Now, what I am not able to understand is - if entire dataset is
> used to train each of the trees, then how does the classifier estimates the
> OOB error? None of the entries of the dataset is an oob for any of the
> trees. (Pardon me if all this sounds BS)
> > > >>>
> > > >>> Help this mere mortal.
> > > >>>
> > > >>> Thanks
> > > >>> _______________________________________________
> > > >>> scikit-learn mailing list
> > > >>> scikit-learn at python.org
> > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>
> > > >> _______________________________________________
> > > >> scikit-learn mailing list
> > > >> scikit-learn at python.org
> > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>
> > > >> _______________________________________________
> > > >> scikit-learn mailing list
> > > >> scikit-learn at python.org
> > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/ead16e5f/attachment-0001.html>