Support for complex data in stats.pearsonr
Hi, A recent issue [1] brought up the idea of adding support for complex data in stats.pearsonr. It seems like this would be a useful feature, especially for folks working in signal processing. It wouldn't be too difficult to implement, see [2]. The current estimation of p-values, however, seems to underestimate the significance. That is, the error rate tends to be lower than the p-value. It would be possible to introduce a permutation test to estimate the p-value of the correlation between complex numbers. Is this something worth pursuing? What do you think of introducing an alternative p-value estimation method? Cheers, Chris [1] https://github.com/scipy/scipy/issues/17518 [2] https://en.wikipedia.org/wiki/Complex_random_variable
After the addition of `stats.permutation_test`, `stats.monte_carlo_test`, and `stats.bootstrap`, the natural next step was to make them easier to use with statistics that SciPy already defines. IMO, the main question has been API. I added a permutation version of `anderson_ksamp` in [3] to resolve [4], but I wanted to decide on a more longer-term plan before adding that sort of feature to many other functions. Some questions to consider: - If we want a function like `pearsonr` to perform a permutation test, how should the user specify the required information? - Parameters `n_resamples` and `random_state` as we did in [3]? - A single object with both of these pieces of information (and possibly more configuration information) to avoid adding several new parameters to the function? - Alternatively, should the user just call `permutation_test` with the callable as `stats.pearsonr`, and we can automatically choose some appropriate settings for the test (e.g. the permutation type)? - In many cases, it can also make sense to perform a Monte Carlo test using the same statistic. - How does the user specify whether to perform the Monte Carlo version of the test or the permutation version (e.g. a `method` parameter)? - There is often a natural choice of distribution(s) from which to draw Monte Carlo samples (e.g. for a two-sample t-test, draw samples from the standard normal). But if one knows something about the distribution(s) from which their data is drawn, it might make sense to customize the test using that information but still use the t-statistic. Should there be options for that, or at that point, do we just direct the user to use `monte_carlo_test` directly? - We can ask the same sorts of questions about the `confidence_interval` method of `pearsonr` and other functions. For some tests, one can envision getting confidence intervals not only via the bootstrap, but also using permutation or Monte Carlo ideas! I'd be happy to discuss this further. I've been interested in this for a while; I've just put this on the back burner because there are a lot of questions to answer ahead of time (unless we are OK with every function re-inventing the interface). Please feel free to open an issue! Matt [3] https://github.com/scipy/scipy/pull/16996 [4] https://github.com/scipy/scipy/issues/9527 On Thu, Feb 16, 2023, 2:01 PM Christopher Cowden <chris.cowden@gmail.com> wrote:
Hi,
A recent issue [1] brought up the idea of adding support for complex data in stats.pearsonr. It seems like this would be a useful feature, especially for folks working in signal processing. It wouldn't be too difficult to implement, see [2]. The current estimation of p-values, however, seems to underestimate the significance. That is, the error rate tends to be lower than the p-value. It would be possible to introduce a permutation test to estimate the p-value of the correlation between complex numbers.
Is this something worth pursuing? What do you think of introducing an alternative p-value estimation method?
Cheers, Chris
[1] https://github.com/scipy/scipy/issues/17518 [2] https://en.wikipedia.org/wiki/Complex_random_variable _______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: mhaberla@calpoly.edu
What if we made these test methods (permutation_test, monte_carlo_test, bootstrap) classes with a __call__ method which could be called recursively from within the statistic. From a user's point of view, it would look something like this: mytest = permutation_test(my fav. arguments) rho, pvalue = pearsonr(data, significance=mytest, *) This lets us encapsulate parameters particular to the tests to the test class rather than having to deal with them within the statistic's method. We can check test parameters from within the statistic methods to prompt warning messages when test parameters may not make sense. We may also want to introduce a "None" option to the significance argument so that some computation time is saved in the test. In the case of the current default pvalue/confidence estimation some string argument to the significance parameter (e.g. significance="default" or "parametric") would turn on that behavior. This would, I think, add a lot of functionality with a minimal change to the API. Setting up a particular kind of test would be similar to how one uses the statistical distribution classes. Perhaps with that thought, rather than a __call__ method, we could implement a test method, a CI method, etc. The call pattern would mimic the statistical distribution API. - Chris On Thu, Feb 16, 2023 at 03:15:41PM -0800, Matt Haberland wrote:
After the addition of `stats.permutation_test`, `stats.monte_carlo_test`, and `stats.bootstrap`, the natural next step was to make them easier to use with statistics that SciPy already defines. IMO, the main question has been API.
I added a permutation version of `anderson_ksamp` in [3] to resolve [4], but I wanted to decide on a more longer-term plan before adding that sort of feature to many other functions.
Some questions to consider:
- If we want a function like `pearsonr` to perform a permutation test, how should the user specify the required information? - Parameters `n_resamples` and `random_state` as we did in [3]? - A single object with both of these pieces of information (and possibly more configuration information) to avoid adding several new parameters to the function? - Alternatively, should the user just call `permutation_test` with the callable as `stats.pearsonr`, and we can automatically choose some appropriate settings for the test (e.g. the permutation type)?
- In many cases, it can also make sense to perform a Monte Carlo test using the same statistic. - How does the user specify whether to perform the Monte Carlo version of the test or the permutation version (e.g. a `method` parameter)? - There is often a natural choice of distribution(s) from which to draw Monte Carlo samples (e.g. for a two-sample t-test, draw samples from the standard normal). But if one knows something about the distribution(s) from which their data is drawn, it might make sense to customize the test using that information but still use the t-statistic. Should there be options for that, or at that point, do we just direct the user to use `monte_carlo_test` directly? - We can ask the same sorts of questions about the `confidence_interval` method of `pearsonr` and other functions. For some tests, one can envision getting confidence intervals not only via the bootstrap, but also using permutation or Monte Carlo ideas!
I'd be happy to discuss this further. I've been interested in this for a while; I've just put this on the back burner because there are a lot of questions to answer ahead of time (unless we are OK with every function re-inventing the interface).
Please feel free to open an issue!
Matt
[3] https://github.com/scipy/scipy/pull/16996 [4] https://github.com/scipy/scipy/issues/9527
On Thu, Feb 16, 2023, 2:01 PM Christopher Cowden <chris.cowden@gmail.com> wrote:
Hi,
A recent issue [1] brought up the idea of adding support for complex data in stats.pearsonr. It seems like this would be a useful feature, especially for folks working in signal processing. It wouldn't be too difficult to implement, see [2]. The current estimation of p-values, however, seems to underestimate the significance. That is, the error rate tends to be lower than the p-value. It would be possible to introduce a permutation test to estimate the p-value of the correlation between complex numbers.
Is this something worth pursuing? What do you think of introducing an alternative p-value estimation method?
Cheers, Chris
[1] https://github.com/scipy/scipy/issues/17518 [2] https://en.wikipedia.org/wiki/Complex_random_variable _______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: mhaberla@calpoly.edu
_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: chris.cowden@gmail.com
What if we made these test methods (permutation_test, monte_carlo_test, bootstrap) classes with a __call__ method
This may be possible, but these functions already exist. If we were to go this route or something like it, it seems easier to modify the result object returned by these functions. which could be called recursively from within the statistic.
Why recursion? I have thought about generating bootstrap confidence intervals around the limits of bootstrap confidence intervals around the...(continue ad infinitum), but that is another story ; )
From a user's point of view, it would look something like this:
mytest = permutation_test(my fav. arguments) rho, pvalue = pearsonr(data, significance=mytest, *)
This lets us encapsulate parameters particular to the tests to the test class rather than having to deal with them within the statistic's method.
"A single object with ... configuration information to avoid adding several new parameters to the function" IIUC, you propose allowing the user to call `permutation_test`, `monte_carlo_test`, `bootstrap` without specifying the (currently required) argument `statistic`. These functions would return a result object - which would no longer include any results, just configuration information - and
Yes. Instead of passing several different parameters into `pearsonr` we would pass the user would pass this object into the statistic function (e.g. `pearsonr`). In this case, it sounds like we would just be using these functions to create an object that specifies configuration information. The thing I like about that is that we don't have to introduce new objects to the API or copy-paste any documentation. Still, because of the next issue, I'm not sure that we should co-opt the existing functions instead of creating some new type of configuration object. We can check test parameters from within the statistic methods
to prompt warning messages when test parameters may not make sense.
Yes. I think it would be preferable to avoid this possibility entirely, though. For instance, out of the nine parameters `permutation_type` accepts, only three of them are valid for specifying the configuration like this - `n_resamples`, `batch`, and `random_state`. If the user passes `data`, `statistic`, `permutation_type`, `vectorized`, `alternative`, or `axis`, that information is either redundant or conflicting with what they would pass into `pearsonr`. I would prefer to avoid that possibility.
We may also want to introduce a "None" option to the significance argument so that some computation time is saved in the test. In the case of the current default pvalue/confidence estimation some string argument to the significance parameter (e.g. significance="default" or "parametric") would turn on that behavior.
This would, I think, add a lot of functionality with a minimal change to the API. I 100% agree that there should be a way to use the hypothesis test functions to calculate only a statistic without calculating p-values. Note that this is already possible by making `pvalue` a `property` of the result object. (It would be preferable if it were a method like `confidence_interval`, IMO, but that ship has sailed without doing something like this https://stackoverflow.com/questions/70253569/how-to-combine-property-and-fun... .) And yes, something like this would add a lot of functionality without much change to the API. I'm just not sure whether we should use the existing functions to generate the configuration object. It might be worth adding one new class (e.g. `ResamplingConfiguration` or some better name) that does exactly what we need.
Setting up a particular kind of test would be similar to how one uses the statistical distribution classes.
Perhaps with that thought, rather than a __call__ method, we could
implement a test method, a CI method, etc. The call pattern would mimic the statistical distribution API.
Yes. This is in the stats roadmap as "Also, provide an infrastructure for implementing hypothesis tests.", and it is a deliverable of our CZI grant. It's on my schedule after the distribution infrastructure work. Besides having separate methods for calculating the point estimate, p-value, and confidence interval, it would allow the developer to define a hypothesis test with three pieces of information: 1. The statistic 2. The null hypothesis 3. The method for calculating the null distribution (sometimes this is combined with #2) In some cases, this is also enough information for computing confidence intervals. (But a little more information is needed when the distribution of the sample statistic changes shape as the true value of the statistic changes.) This is enough information to reproduce the behavior of almost all of our hypothesis tests. (There are a few exceptions like `tukey_hsd` and `boschloo`, but I think it would be great if we could unify the other ~90%.) All of the `axis`, `nan_policy`, and (in many cases) `alternative` behavior can rely on default implementations (or be overridden if desired). We have not discussed whether we would make this object-oriented interface to hypothesis test public, but there would still be value in using it to simplify/correct the implementations of some existing tests and new tests. I've opened an issue (https://github.com/scipy/scipy/issues/18067) to discuss this further. Thanks for your thoughts! Matt On Thu, Feb 16, 2023 at 03:15:41PM -0800, Matt Haberland wrote:
After the addition of `stats.permutation_test`, `stats.monte_carlo_test`, and `stats.bootstrap`, the natural next step was to make them easier to use with statistics that SciPy already defines. IMO, the main question has been API.
I added a permutation version of `anderson_ksamp` in [3] to resolve [4], but I wanted to decide on a more longer-term plan before adding that sort of feature to many other functions.
Some questions to consider:
- If we want a function like `pearsonr` to perform a permutation test, how should the user specify the required information? - Parameters `n_resamples` and `random_state` as we did in [3]? - A single object with both of these pieces of information (and possibly more configuration information) to avoid adding several new parameters to the function? - Alternatively, should the user just call `permutation_test` with the callable as `stats.pearsonr`, and we can automatically choose some appropriate settings for the test (e.g. the permutation type)?
- In many cases, it can also make sense to perform a Monte Carlo test using the same statistic. - How does the user specify whether to perform the Monte Carlo version of the test or the permutation version (e.g. a `method` parameter)? - There is often a natural choice of distribution(s) from which to draw Monte Carlo samples (e.g. for a two-sample t-test, draw samples from the standard normal). But if one knows something about the distribution(s) from which their data is drawn, it might make sense to customize the test using that information but still use the t-statistic. Should there be options for that, or at that point, do we just direct the user to use `monte_carlo_test` directly? - We can ask the same sorts of questions about the `confidence_interval` method of `pearsonr` and other functions. For some tests, one can envision getting confidence intervals not only via the bootstrap, but also using permutation or Monte Carlo ideas!
I'd be happy to discuss this further. I've been interested in this for a while; I've just put this on the back burner because there are a lot of questions to answer ahead of time (unless we are OK with every function re-inventing the interface).
Please feel free to open an issue!
Matt
[3] https://github.com/scipy/scipy/pull/16996 [4] https://github.com/scipy/scipy/issues/9527
On Thu, Feb 16, 2023, 2:01 PM Christopher Cowden <chris.cowden@gmail.com
wrote:
Hi,
A recent issue [1] brought up the idea of adding support for complex data in stats.pearsonr. It seems like this would be a useful feature, especially for folks working in signal processing. It wouldn't be too difficult to implement, see [2]. The current estimation of p-values, however, seems to underestimate the significance. That is, the error rate tends to be lower than the p-value. It would be possible to introduce a permutation test to estimate the p-value of the correlation between complex numbers.
Is this something worth pursuing? What do you think of introducing an alternative p-value estimation method?
Cheers, Chris
[1] https://github.com/scipy/scipy/issues/17518 [2] https://en.wikipedia.org/wiki/Complex_random_variable _______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: mhaberla@calpoly.edu
_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: chris.cowden@gmail.com
_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: mhaberla@calpoly.edu
-- Matt Haberland Assistant Professor BioResource and Agricultural Engineering 08A-3K, Cal Poly
participants (3)
-
Chris Cowden
-
Christopher Cowden
-
Matt Haberland