Inconsistent standard deviation and variance implementation in scipy vs. scipy.stats
Hi It seems that the default implementation of std and var differs between numpy/scipy and scipy.stats, in that numpy/scipy is using the "biased" formulation (i.e. dividing by N) whereas scipy.stats is using the "unbiased" formulation (dividing by N-1) by default. Is this intentional (it could be potentially confusing...)? I realise that the "biased" version can be accessed in sp.stats with a kwarg, but what is the reason for two different implementations of the function(s)? In [30]: a Out[30]: array([ 1., 2., 3., 2., 3., 1.]) In [31]: np.std(a) Out[31]: 0.81649658092772603 In [32]: sp.std(a) Out[32]: 0.81649658092772603 In [33]: sp.stats.std(a) Out[33]: 0.89442719099991586 In [34]: sp.stats.std(a, bias=True) Out[34]: 0.81649658092772603 Same for np.var vs scipy.stats.var Johann
Johann, You can also get the unbiased estimates with numpy by setting the optional parameter ddof=1.
a=np.array([ 1., 2., 3., 2., 3., 1.]) np.std(a) 0.81649658092772603 np.std(a, ddof=1) 0.89442719099991586
I think the default to biased estimates was kept for backward compatibility.
On 9/24/2008 12:05 PM Pierre GM apparently wrote:
I think the default to biased estimates was kept for backward compatibility.
It is still a problem that scip.var and scipy.stats.var behave differently (and even have a different signature). What is the way forward? An opening suggestion: unify the signature, let ``bias`` be a deprecated way to to set ``ddof``, and warn users of scipy.stats.var (or std) if they do not set ``ddof``. Alan Isaac
2008/9/24 Alan G Isaac <aisaac@american.edu>:
On 9/24/2008 12:05 PM Pierre GM apparently wrote:
I think the default to biased estimates was kept for backward compatibility.
It is still a problem that scip.var and scipy.stats.var behave differently (and even have a different signature). What is the way forward?
An opening suggestion: unify the signature, let ``bias`` be a deprecated way to to set ``ddof``, and warn users of scipy.stats.var (or std) if they do not set ``ddof``.
How about (possibly in addition to your suggestion) deprecating the re-exporting of numpy functions inside scipy? People often seem to ask about whether they should be using the scipy "version" or the numpy "version" of some function, when in fact it's just a re-exporting of the name. This still leaves the question of an inconsistency between scipy and numpy, for which I think your suggestion is a reasonable solution. Anne
2008/9/24 Anne Archibald <peridot.faceted@gmail.com>:
How about (possibly in addition to your suggestion) deprecating the re-exporting of numpy functions inside scipy? People often seem to ask about whether they should be using the scipy "version" or the numpy "version" of some function, when in fact it's just a re-exporting of the name.
I'm all in favour of that suggestion. Cheers Stéfan
2008/9/24 Anne Archibald <peridot.faceted@gmail.com>:
How about (possibly in addition to your suggestion) deprecating the re-exporting of numpy functions inside scipy? People often seem to ask about whether they should be using the scipy "version" or the numpy "version" of some function, when in fact it's just a re-exporting of the name.
On 9/25/2008 3:48 AM Stéfan van der Walt apparently wrote:
I'm all in favour of that suggestion.
I am not taking a strong position other than to say user convenience should matter, but the following really seems adequate to me: >>> help(sp.var) Help on function var in module numpy.core.fromnumeric: Cheers, Alan Isaac
Wed, 24 Sep 2008 16:35:45 -0400, Anne Archibald wrote:
2008/9/24 Alan G Isaac <aisaac@american.edu>:
On 9/24/2008 12:05 PM Pierre GM apparently wrote:
I think the default to biased estimates was kept for backward compatibility.
It is still a problem that scip.var and scipy.stats.var behave differently (and even have a different signature). What is the way forward?
An opening suggestion: unify the signature, let ``bias`` be a deprecated way to to set ``ddof``, and warn users of scipy.stats.var (or std) if they do not set ``ddof``.
How about (possibly in addition to your suggestion) deprecating the re-exporting of numpy functions inside scipy? People often seem to ask about whether they should be using the scipy "version" or the numpy "version" of some function, when in fact it's just a re-exporting of the name.
The opposite direction would be completely removing `var` from scipy.stats. Is there a reason why the function is reimplemented in scipy? There's probably need eg. for float -> complex casting sqrt(), but I don't clearly see why there are two variants of `var`. Personally, I'd prefer not to have the same function reimplemented in two places, unless there is a clear need for it. I think there are more examples of duplication / signature mismatches in scipy vs. numpy that could be cleaned up a bit, at least in scipy.linalg. -- Pauli Virtanen
On Thursday, 25 September 2008, Pauli Virtanen wrote:
The opposite direction would be completely removing `var` from scipy.stats. Is there a reason why the function is reimplemented in scipy? There's probably need eg. for float -> complex casting sqrt(), but I don't clearly see why there are two variants of `var`.
Personally, I'd prefer not to have the same function reimplemented in two places, unless there is a clear need for it. I think there are more examples of duplication / signature mismatches in scipy vs. numpy that could be cleaned up a bit, at least in scipy.linalg.
I agree that duplicate implementations of the same function are confusing. However, within numpy itself there is further inconsistency, in that np.var and np.std use the "ddof" kwarg, whereas np.cov uses the "bias" kwarg (as do sp.stats.std and sp.stats.var). Also, default normalisation in np.cov is by N-1 (unbiased) wheres in np.std and np.var the default is by N (unbiased). Johann
On Thu, Sep 25, 2008 at 5:35 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
How about (possibly in addition to your suggestion) deprecating the re-exporting of numpy functions inside scipy? People often seem to ask about whether they should be using the scipy "version" or the numpy "version" of some function, when in fact it's just a re-exporting of the name.
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed). cheers, David
2008/9/26 David Cournapeau <cournape@gmail.com>:
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed).
There were no objections to this, so may we go ahead? Stéfan
2008/10/6 Stéfan van der Walt <stefan@sun.ac.za>:
2008/9/26 David Cournapeau <cournape@gmail.com>:
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed).
There were no objections to this, so may we go ahead?
My only concern is possible user confusion: some functions (e.g. sqrt) are provided as "enhanced" versions in scipy, while others are simply reexported. If we remove the reexports, users can't simply use scipy.whatever to get the best-available version of each function, they have to know whether an enhanced version exists. Of course, since the enhanced versions exist because their APIs differ in important and possibly surprising ways (e.g., sqrt(-1) has a different return type from sqrt(1)) this may be a good thing. Anne
On Mon, Oct 6, 2008 at 8:34 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
2008/10/6 Stéfan van der Walt <stefan@sun.ac.za>:
2008/9/26 David Cournapeau <cournape@gmail.com>:
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed).
There were no objections to this, so may we go ahead?
My only concern is possible user confusion: some functions (e.g. sqrt) are provided as "enhanced" versions in scipy, while others are simply reexported. If we remove the reexports, users can't simply use scipy.whatever to get the best-available version of each function, they have to know whether an enhanced version exists. Of course, since the enhanced versions exist because their APIs differ in important and possibly surprising ways (e.g., sqrt(-1) has a different return type from sqrt(1)) this may be a good thing.
Arguing that SciPy is below 1.0 I think the reexporting should be minimized as much as possible. I don't thing that some people's preference for "from scipy import *" (without a preceding "from numpy import *") should not be a deciding point. It would be nice if it were clear enough (in general) if a given function should be expected to part of numpy or part of scipy. Left over uncertainties should get documented is a short form - a list maybe. My two cents.... -Sebastian Haase
2008/10/6 Sebastian Haase <haase@msg.ucsf.edu>:
On Mon, Oct 6, 2008 at 8:34 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
2008/10/6 Stéfan van der Walt <stefan@sun.ac.za>:
2008/9/26 David Cournapeau <cournape@gmail.com>:
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed).
There were no objections to this, so may we go ahead?
My only concern is possible user confusion: some functions (e.g. sqrt) are provided as "enhanced" versions in scipy, while others are simply reexported. If we remove the reexports, users can't simply use scipy.whatever to get the best-available version of each function, they have to know whether an enhanced version exists. Of course, since the enhanced versions exist because their APIs differ in important and possibly surprising ways (e.g., sqrt(-1) has a different return type from sqrt(1)) this may be a good thing.
Arguing that SciPy is below 1.0 I think the reexporting should be minimized as much as possible. I don't thing that some people's preference for "from scipy import *" (without a preceding "from numpy import *") should not be a deciding point.
The case I was concerned about was import scipy as sp x = sp.cos(2*sp.arccos(y)) If this is changed to import numpy as np x = np.cos(2*np.arccos(y)) it suddenly stops working for values y>1. To keep the same behaviour it needs to be import scipy as sp import numpy as np x = np.cos(2*sp.arccos(y)) This is perhaps all right, but it does mean that users need to pay attention. Anne
On Mon, Oct 6, 2008 at 21:30, Anne Archibald <peridot.faceted@gmail.com> wrote:
The case I was concerned about was
import scipy as sp x = sp.cos(2*sp.arccos(y))
If this is changed to
import numpy as np x = np.cos(2*np.arccos(y))
it suddenly stops working for values y>1. To keep the same behaviour it needs to be
import scipy as sp import numpy as np x = np.cos(2*sp.arccos(y))
This is perhaps all right, but it does mean that users need to pay attention.
Well, if we remove some names from scipy/__init__.py, we should remove them all. The actual definitions of those extended-domain functions are actually in numpy.lib.scimath, not scipy. We can make a convenient module inside numpy that basically does this: from numpy import * from numpy.lib.scimath import * Then the transition for people using "import scipy" becomes quite easy. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Stéfan van der Walt wrote:
2008/9/26 David Cournapeau <cournape@gmail.com>:
Yes, it would be nice. What do other people think about deprecating all the numpy re-export in scipy ? It would be nice to do for 0.7 (e.g. in 0.7, deprecated, in 0.8, removed).
There were no objections to this, so may we go ahead?
+1 (My 0.02) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma
On Thu, Sep 25, 2008 at 4:29 AM, Alan G Isaac <aisaac@american.edu> wrote:
An opening suggestion: unify the signature, let ``bias`` be a deprecated way to to set ``ddof``, and warn users of scipy.stats.var (or std) if they do not set ``ddof``.
The problem is that there is another discrepancy between numpy.var and scipy.stats.var: the axis argument is 0 in scipy.stats, None in numpy. So if we want to be really compatible, we can't just add a new argument and deprecate the old one; we have to deprecate the current signature, and change it later. I suggested some time ago to deprecate scipy.stats current signature for 0.7, and set the new one in 0.8. If that's fine with you, we could do that. I don't feel confortable changing a function in scipy.stats (because it is not "my" module), but OTOH, nobody reacted last time we had this discussion, so maybe we should just do it. David
2008/9/25 David Cournapeau <cournape@gmail.com>:
I suggested some time ago to deprecate scipy.stats current signature for 0.7, and set the new one in 0.8. If that's fine with you, we could do that. I don't feel confortable changing a function in scipy.stats (because it is not "my" module), but OTOH, nobody reacted last time we had this discussion, so maybe we should just do it.
Yes, do it. If someone complains, that's why we have SVN. The SciPy API is not (and cannot be) frozen -- it is still too immature, so let's get it up to scratch ASAP. Cheers Stéfan
On Thu, Sep 25, 2008 at 12:50 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
2008/9/25 David Cournapeau <cournape@gmail.com>:
I suggested some time ago to deprecate scipy.stats current signature for 0.7, and set the new one in 0.8. If that's fine with you, we could do that. I don't feel confortable changing a function in scipy.stats (because it is not "my" module), but OTOH, nobody reacted last time we had this discussion, so maybe we should just do it.
Yes, do it. If someone complains, that's why we have SVN. The SciPy API is not (and cannot be) frozen -- it is still too immature, so let's get it up to scratch ASAP.
I also don't think it is absolutely necessary to have a deprecation release. The code is officially labeled beta and is currently not regularly released. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
Jarrod Millman wrote:
I also don't think it is absolutely necessary to have a deprecation release. The code is officially labeled beta and is currently not regularly released.
Yes, but in that case, it would be pretty bad to be caught by it. Deprecating does not cost us anything (well, it cost me a couple of minutes), and cost a lot to users. I *hated* it when numpy changed its axis argument before the 1. release, it took me hours to track it down everywhere in my code (missing argument is different: you know that something is wrong right away). I would prefer not inflicting this to other people. Not in my name, at least :) cheers, David
Jarrod Millman wrote:
On Thu, Sep 25, 2008 at 12:50 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
2008/9/25 David Cournapeau <cournape@gmail.com>:
I suggested some time ago to deprecate scipy.stats current signature for 0.7, and set the new one in 0.8. If that's fine with you, we could do that. I don't feel confortable changing a function in scipy.stats (because it is not "my" module), but OTOH, nobody reacted last time we had this discussion, so maybe we should just do it. Yes, do it. If someone complains, that's why we have SVN. The SciPy API is not (and cannot be) frozen -- it is still too immature, so let's get it up to scratch ASAP.
I added a Deprecation warning with the correct function to use in numpy (with alternative arguments) for the following: - mean - median - std - var - cov - corrcoeff AFAICS, all functionality of any of those is available in numpy (contrary to what the comment says, I guess they are vastly out of date), cheers, David
participants (12)
-
Alan G Isaac -
Anne Archibald -
David Cournapeau -
David Cournapeau -
Jarrod Millman -
Johann Rohwer -
Pauli Virtanen -
Pierre GM -
Robert Kern -
Ryan May -
Sebastian Haase -
Stéfan van der Walt