[SciPy-user] information on statistical functions

Wed Dec 17 23:26:02 EST 2008

On Wed, Dec 17, 2008 at 10:11 PM, Robert Kern <robert.kern at gmail.com> wrote:

>
> The problem is that the "unbiased" estimate for the standard deviation
> is *not* the square root of the "unbiased" estimate for the variance.
> The latter is what numpy.std(x, ddof=1) calculates, not the former.
> This problem arises because of a pretty narrow concept of "error" that
> gets misapplied to the variance estimator. The usual "error" that gets
> used is the arithmetic difference between the estimate and the true
> value of the parameter (p_est - p_true). For parameters like means,
> this is usually fine, but for so-called scale parameters like
> variance, it's quite inappropriate. For example, the arithmetic error
> between a true value of 1.0 (in whatever units) and an estimate of 2.0
> is the same as that between 101.0 and 102.0. When you drop a square
> root into that formula, you don't get the same answers out when you
> seek the estimator that sets the bias to 0.

Old habits lead me astray, in response to your previous email, I
checked the docs for variance not for standard deviation. I never
looked at the statistical properties of estimators of the standard
deviation only those of variance estimators. I learned the unbiased
estimator as a contrast to the maximum likelihood estimator for the
variance.
So, thank you for the clarification, I guess it is the same story with
the scale parameter estimation in the distribution estimation.

>
> Rather, a much more appropriate error measure for variance would be
> the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the
> same distance from each other as 100.0 and 200.0. Using this measure
> of error to define bias, the unbiased estimate of the standard
> deviation actually is the square root of the unbiased estimate of the
> variance, too, thanks to the magic of logarithms. Unfortunately for
> those who want to call the (n-1) version "unbiased", the unbiased
> estimator (for normal distributions, at least) uses (n-2). Oops! Other
> distributions have different optimal denominators: heavier-tailed
> distributions tend towards (n-3), finite-support distributions tend
> towards (n-1).
>
> But of course, bias is not the only thing to be concerned about. The
> bias is just the arithmetic average of the errors. If you want to
> minimize the total spread of the errors sum(abs(err)), too, that's
> another story. With the arithmetic error metric, the unbiased
> estimator of the variance uses (n-1) while the estimator with the
> smallest total error uses (n). With the log-ratio error metric, the
> unbiased estimator is the same as the one that minimizes the total
> error. Happy days!

Agreed, but this then leads to the discussion of what the appropriate
loss function for the estimator is, which for a standard statistical
presentation will go to far.

Fortunately, for large samples, n versus n-1 or n-2 doesn't matter,
and asymptotically we are all the same, at least in nice cases.

>
> I also find the population/sample distinction to be bogus, too. IIRC,
> there are even some sources which switch around the meanings, too. In
> any case, the docstring should have a section saying, "If you are
> looking for what is called the "unbiased" or "sample" estimate of
> variance, use ddof=1." Those terms are widely, if incorrectly, used,
> so we should mention them. I just find it disheartening that the terms
> are used without qualification.
>

I compared the doc strings for np.var and np.std. np.std is very brief
on biasedness, and I think the sentence in np.var is relatively
neutral, neither doc string mentions the word unbiased. Your reference
to "unbiased" and "sample" estimator might be useful for users that
are not so familiar with the statistical background. After your
explanation, I would think it would be better to drop any reference to
biasedness in np.std. On the other hand, the scipy doc strings still
need a lot of cleaning and writing.

Josef