![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 10:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former. This problem arises because of a pretty narrow concept of "error" that gets misapplied to the variance estimator. The usual "error" that gets used is the arithmetic difference between the estimate and the true value of the parameter (p_est - p_true). For parameters like means, this is usually fine, but for so-called scale parameters like variance, it's quite inappropriate. For example, the arithmetic error between a true value of 1.0 (in whatever units) and an estimate of 2.0 is the same as that between 101.0 and 102.0. When you drop a square root into that formula, you don't get the same answers out when you seek the estimator that sets the bias to 0.
Old habits lead me astray, in response to your previous email, I checked the docs for variance not for standard deviation. I never looked at the statistical properties of estimators of the standard deviation only those of variance estimators. I learned the unbiased estimator as a contrast to the maximum likelihood estimator for the variance. So, thank you for the clarification, I guess it is the same story with the scale parameter estimation in the distribution estimation.
Rather, a much more appropriate error measure for variance would be the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the same distance from each other as 100.0 and 200.0. Using this measure of error to define bias, the unbiased estimate of the standard deviation actually is the square root of the unbiased estimate of the variance, too, thanks to the magic of logarithms. Unfortunately for those who want to call the (n-1) version "unbiased", the unbiased estimator (for normal distributions, at least) uses (n-2). Oops! Other distributions have different optimal denominators: heavier-tailed distributions tend towards (n-3), finite-support distributions tend towards (n-1).
But of course, bias is not the only thing to be concerned about. The bias is just the arithmetic average of the errors. If you want to minimize the total spread of the errors sum(abs(err)), too, that's another story. With the arithmetic error metric, the unbiased estimator of the variance uses (n-1) while the estimator with the smallest total error uses (n). With the log-ratio error metric, the unbiased estimator is the same as the one that minimizes the total error. Happy days!
Agreed, but this then leads to the discussion of what the appropriate loss function for the estimator is, which for a standard statistical presentation will go to far. Fortunately, for large samples, n versus n-1 or n-2 doesn't matter, and asymptotically we are all the same, at least in nice cases.
I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification.
I compared the doc strings for np.var and np.std. np.std is very brief on biasedness, and I think the sentence in np.var is relatively neutral, neither doc string mentions the word unbiased. Your reference to "unbiased" and "sample" estimator might be useful for users that are not so familiar with the statistical background. After your explanation, I would think it would be better to drop any reference to biasedness in np.std. On the other hand, the scipy doc strings still need a lot of cleaning and writing. Josef