![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 20:32, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 9:03 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Dec 17, 2008 at 19:53, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen <timmichelsen@gmx-topmail.de> wrote:
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules:
Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci...
What is the difference? There is no formula included within the docstrings.
I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true?
difference between population (numpy) and sample (scipy.stats) variance and standard deviation is whether the the estimator is biased, i.e. 1/n, or not, i.e. 1/(n-1).
It's a shame that the "biased/unbiased" terminology still survives in the numpy.std() docstring. It's really quite wrong.
I find talking about biased versus unbiased estimator much clearer than the population - sample distinction, and degrees of freedom might be more descriptive but its meaning, I guess, relies on knowing about the (asymptotic) distribution of the estimator, which I always forget and have to look up.
The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former. This problem arises because of a pretty narrow concept of "error" that gets misapplied to the variance estimator. The usual "error" that gets used is the arithmetic difference between the estimate and the true value of the parameter (p_est - p_true). For parameters like means, this is usually fine, but for so-called scale parameters like variance, it's quite inappropriate. For example, the arithmetic error between a true value of 1.0 (in whatever units) and an estimate of 2.0 is the same as that between 101.0 and 102.0. When you drop a square root into that formula, you don't get the same answers out when you seek the estimator that sets the bias to 0. Rather, a much more appropriate error measure for variance would be the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the same distance from each other as 100.0 and 200.0. Using this measure of error to define bias, the unbiased estimate of the standard deviation actually is the square root of the unbiased estimate of the variance, too, thanks to the magic of logarithms. Unfortunately for those who want to call the (n-1) version "unbiased", the unbiased estimator (for normal distributions, at least) uses (n-2). Oops! Other distributions have different optimal denominators: heavier-tailed distributions tend towards (n-3), finite-support distributions tend towards (n-1). But of course, bias is not the only thing to be concerned about. The bias is just the arithmetic average of the errors. If you want to minimize the total spread of the errors sum(abs(err)), too, that's another story. With the arithmetic error metric, the unbiased estimator of the variance uses (n-1) while the estimator with the smallest total error uses (n). With the log-ratio error metric, the unbiased estimator is the same as the one that minimizes the total error. Happy days! I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco