[scikit-learn] Random Forest with Bootstrapping

Mon Oct 3 15:36:54 EDT 2016

Hi,

That helped a lot. Thank you very much. I have one more (silly?) doubt
though.

Won't an n-sized bootstrapped sample have repeated entries? Say we have an
original dataset of size 100. A bootstrap sample (say, B) of size 100 is
drawn from this set. Since 32 of the original samples are left out
(theoretically at least), some of the samples in B must be repeated?

On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
>
> import matplotlib.pyplot as plt
>
> vs = []
> for n in range(5, 201, 5):
>     v = 1 - (1. - 1./n)**n
>     vs.append(v)
>
> plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
>           markersize=6,
>           alpha=0.5,)
>
> plt.xlabel('n')
> plt.ylabel('1 - (1 - 1/n)^n')
> plt.xlim([0, 210])
> plt.show()
>
> > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> >
> > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> >
> > P(not_chosen) = (1 - 1\n)^n
> >
> > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> >
> > This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)
> >
> > Then, you can compute the probability of a sample being chosen as
> >
> > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> >
> > Best,
> > Sebastian
> >
> >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for the reply. Please bear with me for a while.
> >>
> >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> >>
> >>
> >> Thanks
> >>
> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> >>
> >> Yes, that should be correct!
> >>
> >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >>
> >> If you take an n-size bootstrap sample, where n is the number of
> samples in your dataset, you have asymptotically 0.632 * n unique samples
> in your bootstrap set. Or in other words 0.368 * n samples are not used for
> growing the respective tree (to compute the OOB). As far as I understand,
> the random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
> >>
> >> Best,
> >> Sebastian
> >>
> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >>>
> >>> Dear Developers,
> >>>
> >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> >>>
> >>> (Note: Please do correct me if I am not making any sense.)
> >>>
> >>> RandomForestClassifier has an option of 'bootstrap'. The API states
> the following
> >>>
> >>> The sub-sample size is always the same as the original input sample
> size but the samples are drawn with replacement if bootstrap=True (default).
> >>>
> >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >>>
> >>> Help this mere mortal.
> >>>
> >>> Thanks
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/196d5d39/attachment-0001.html>