[scikit-learn] Bootstrapping in sklearn

Mon Sep 17 20:26:37 EDT 2018

Hi all,

As everyone knows sklearn is excellent for building predictive models, but
an area where I believe there is still work to be done is in coming up with
measurements for the inherent uncertainties in those models.  (That there
is an appetite for this is I believe evidenced by the rise in popularity of
probabilistic programming.)  We can, for example, easily find point
estimates for coefficients of linear models in sklearn, but making
inferences from those point estimates is not possible without measurements
of probable error.

To address this and other problems I authored a package called resample
which implements the bootstrap and other randomization-based procedures
with the goal of performing largely nonparametric statistical inference on
a wide class of problems.  The package is built entirely in numpy and scipy
and so already integrates fairly well with sklearn (there is a tutorial
here which among other things shows applications using the Boston housing
data: https://github.com/dsaxton/resample/blob/master/doc/resample.ipynb)

Might there be interest in including something like this as an
sklearn-contrib package?  Essentially we are taking what is already in
sklearn.utils.resample and extending it to include other forms of the
bootstrap (e.g., balanced, parametric, stratified and / or smoothed),
algorithms for computing automatic confidence intervals, and procedures for
doing nonparametric, randomization-based hypothesis testing.

Here is the Github page:

https://github.com/dsaxton/resample

Of course, I also would greatly appreciate any input that others might have
on ways that this package could be made more useful.

Thanks,
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180917/9e7b9431/attachment.html>