[SciPy-Dev] Improvement Suggestion for scipy.stats.spearmanr and by extension scipy.stats.mstats.spearmanr

Fri Apr 23 14:05:31 EDT 2021

*Intro:*
scipy.stats.spearmanr calculates the spearman correlation between two 1D
arrays, or when presented with 2D array(s), performs the same operation
pairwise on the comprising 1D arrays.
Currently, scipy.stats.spearmanr uses scipy.stats.mstats.spearmanr under
the hood which is where the issue arises.
When talking about matching non-NaN values between two arrays, consider
these two arrays:
[1,NaN]
[2,3]
position 0 is matching (since both arrays do not have a NaN) and position 1
is not matching.

*The Issue <https://github.com/scipy/scipy/issues/13900  >:*
When using scipy.stats.spearmanr with *nan_policy='omit'*, it will produce
the error *ValueError: The input must have at least 3 entries! * when
comparing two arrays which have exactly 1 matched pair of non-NaN values,
given that one of the arrays contains at least one NaN.
This becomes a problem when using spearmanr on large, sparse datasets where
either only aggressive NaN filtering or manual error-catching may prevent
this error.
According to the nan policy doc
<http://scipy.github.io/devdocs/dev/api-dev/nan_policy.html> with respect
to *nan_policy='omit'*:

*More generally, for functions that return a scalar, func(a,
nan_policy='omit') should behave the same as func(a[~np.isnan(a)]).*

*Suggested Improvement:*
I therefore suggest that scipy.stats.mstats.spearmanr and
scipy.stats.spearmanr be altered so as to return
*SpearmanrResult(correlation=nan,
pvalue=nan)* given two arrays that have exactly 1 matched pair of non-NaN
values. I have been corresponding with mdhaber who mentioned this would be
a difficult first issue for me to contribute, since it would *break
backwards compatibility*. He also pointed me to another issue
<https://github.com/scipy/scipy/issues/12241> that likely stems from the
same issue, resulting in inaccurate p-values and correlation values in
correlations involving arrays containing NaNs and arrays of all 0s.

*Asking for feedback:*
Are there reasons to prefer the error that is currently being raised?
Should scipy.stats.spearmanr instead produce an error for both the cases
with and without NaNs where only a single non-NaN value is matched between
two arrays, i.e. the following should also raise the error?
*spearmanr([1], [2], nan_policy='omit')*

Best,
Tobias Schraink
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20210423/692f0e96/attachment.html>