Questions/suggestions for bootstrap docs

Hi all, I was trying to use the scipy.stats.bootstrap function and wasn't sure that I was understanding the format of the `data` argument. Happy to make a PR if I'm understanding things correctly and can improve the docs. The docs say:
Each element of data is a sample from an underlying distribution.
I think I'm confused about what the elements of `data` mean versus the dimension of the arrays that make up the iterable `data`. It's also not clear which things are assumed to be a "sample from an underlying distribution", e.g., each element of `data` or the elements of the arrays in `data`. Based on the examples, it seems like bootstrap can be used in these scenarios - A 1d array of data, x, with a statistic that takes a single argument like `np.mean`. In this case, data=(x,) - A set of (paired) 1d arrays of data, (x, y), with a statistic that takes multiple arguments, like `pearsonr`. In this case, data=(x, y) I think this scenario is possible based on the examples, but I wasn't sure based on the docs - An Nd array of data, x, with a statistic that you want to compute along an axis, and return bootstrap statistics for each element of the other dimensions. For instance, if you want to bootstrap multiple datasets at once. In this case, data=(x,) with x.ndim > 1. I'm not clear if bootstrap can be used in this scenario - `statistic` is implicitly computed along multiple axes, and requires a vector of features per sample (e.g., R^2 from fitting a linear model from bootstrapped multivariate samples) *Is my understanding of these things correct?:* - len(data) should be equal to the number of args required by `statistic`? Is there any other reason to have len(data) > 1? - Assuming `statistic` can be computed along an axis, it has to return an array of statistics with exactly 1 fewer dimensions than the input arrays. For instance, if x is 2d, np.mean(x, axis=-1) returns a 1d array of means and is allowed, but multiclass logistic regression would take in 2d arrays (x, y) but return a scalar (accuracy) which would not be allowed. Thanks for any help, Jesse -- Jesse Livezey he/him/his

Hi Jesse, Based on the examples, it seems like bootstrap can be used in these
scenarios
There are many other scenarios. It can be used with unpaired multi-sample statistics like an unpaired t-test. It can be used with n-sample statistics, paired and unpaired. I think this scenario is possible based on the examples, but I wasn't sure
based on the docs
Yes. If your statistic function accepts only one argument, and you have an N dimensional `x`, then you can pass in `data = (x,)` and use `axis` to specify the dimension the statistic is to be taken along. I'm not clear if bootstrap can be used in this scenario Maybe. `bootstrap` does not currently support tuple `axis`, but could you ravel those dimensions of your array before passing them in? len(data) should be equal to the number of args required by `statistic`? Yes.
Is there any other reason to have len(data) > 1?
If statistic accepts more than one argument, then `len(data) > 1`. Otherwise, no. Assuming `statistic` can be computed along an axis, it has to return an
array of statistics with exactly 1 fewer dimensions than the input arrays. For instance, if x is 2d, np.mean(x, axis=-1) returns a 1d array of means and is allowed, but multiclass logistic regression would take in 2d arrays (x, y) but return a scalar (accuracy) which would not be allowed.
I think what you've written is exactly correct. Depending on how the regression works (I'm not familiar with it), you might be able to get around this by raveling the dimensions of `x` and `y` first, then turning it back into an array inside `statistic` before doing the regression. But then the resampling will treat all the elements of the array as interchangeable, so I'm not sure if that will work for you. Could you describe how you'd like bootstrap to resample the data in a 2D array? We've talked about adding block resampling methods, and that might be adapted to work for you. On Fri, Oct 15, 2021 at 11:58 AM Jesse Livezey <jesse.livezey@gmail.com> wrote:
Hi all,
I was trying to use the scipy.stats.bootstrap function and wasn't sure that I was understanding the format of the `data` argument. Happy to make a PR if I'm understanding things correctly and can improve the docs. The docs say:
Each element of data is a sample from an underlying distribution.
I think I'm confused about what the elements of `data` mean versus the dimension of the arrays that make up the iterable `data`. It's also not clear which things are assumed to be a "sample from an underlying distribution", e.g., each element of `data` or the elements of the arrays in `data`.
Based on the examples, it seems like bootstrap can be used in these scenarios
- A 1d array of data, x, with a statistic that takes a single argument like `np.mean`. In this case, data=(x,) - A set of (paired) 1d arrays of data, (x, y), with a statistic that takes multiple arguments, like `pearsonr`. In this case, data=(x, y)
I think this scenario is possible based on the examples, but I wasn't sure based on the docs
- An Nd array of data, x, with a statistic that you want to compute along an axis, and return bootstrap statistics for each element of the other dimensions. For instance, if you want to bootstrap multiple datasets at once. In this case, data=(x,) with x.ndim > 1.
I'm not clear if bootstrap can be used in this scenario
- `statistic` is implicitly computed along multiple axes, and requires a vector of features per sample (e.g., R^2 from fitting a linear model from bootstrapped multivariate samples)
*Is my understanding of these things correct?:* - len(data) should be equal to the number of args required by `statistic`? Is there any other reason to have len(data) > 1? - Assuming `statistic` can be computed along an axis, it has to return an array of statistics with exactly 1 fewer dimensions than the input arrays. For instance, if x is 2d, np.mean(x, axis=-1) returns a 1d array of means and is allowed, but multiclass logistic regression would take in 2d arrays (x, y) but return a scalar (accuracy) which would not be allowed.
Thanks for any help, Jesse
-- Jesse Livezey he/him/his _______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: haberland@ucla.edu
-- Matt Haberland Assistant Professor BioResource and Agricultural Engineering 08A-3K, Cal Poly

I read a article under the same title some time ago, but this articles quality is much https://bit.ly/3lx5Aen

Great post.. glad i came across this looking forward to share this with everyone here. https://bit.ly/3jMrKrL
participants (4)
-
Asher Lewis
-
Jesse Livezey
-
Matt Haberland
-
NoraFitas@protonmail.com