Re: [SciPy-user] information on statistical functions

18 Dec 2008

      On Wed, Dec 17, 2008 at 10:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
...
The problem is that the "unbiased" estimate for the standard deviation
is *not* the square root of the "unbiased" estimate for the variance.
The latter is what numpy.std(x, ddof=1) calculates, not the former.
This problem arises because of a pretty narrow concept of "error" that
gets misapplied to the variance estimator. The usual "error" that gets
used is the arithmetic difference between the estimate and the true
value of the parameter (p_est - p_true). For parameters like means,
this is usually fine, but for so-called scale parameters like
variance, it's quite inappropriate. For example, the arithmetic error
between a true value of 1.0 (in whatever units) and an estimate of 2.0
is the same as that between 101.0 and 102.0. When you drop a square
root into that formula, you don't get the same answers out when you
seek the estimator that sets the bias to 0.
Old habits lead me astray, in response to your previous email, I
checked the docs for variance not for standard deviation. I never
looked at the statistical properties of estimators of the standard
deviation only those of variance estimators. I learned the unbiased
estimator as a contrast to the maximum likelihood estimator for the
variance.
So, thank you for the clarification, I guess it is the same story with
the scale parameter estimation in the distribution estimation.
...
Rather, a much more appropriate error measure for variance would be
the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the
same distance from each other as 100.0 and 200.0. Using this measure
of error to define bias, the unbiased estimate of the standard
deviation actually is the square root of the unbiased estimate of the
variance, too, thanks to the magic of logarithms. Unfortunately for
those who want to call the (n-1) version "unbiased", the unbiased
estimator (for normal distributions, at least) uses (n-2). Oops! Other
distributions have different optimal denominators: heavier-tailed
distributions tend towards (n-3), finite-support distributions tend
towards (n-1).
But of course, bias is not the only thing to be concerned about. The
bias is just the arithmetic average of the errors. If you want to
minimize the total spread of the errors sum(abs(err)), too, that's
another story. With the arithmetic error metric, the unbiased
estimator of the variance uses (n-1) while the estimator with the
smallest total error uses (n). With the log-ratio error metric, the
unbiased estimator is the same as the one that minimizes the total
error. Happy days!
Agreed, but this then leads to the discussion of what the appropriate
loss function for the estimator is, which for a standard statistical
presentation will go to far.

Fortunately, for large samples, n versus n-1 or n-2 doesn't matter,
and asymptotically we are all the same, at least in nice cases.
...
I also find the population/sample distinction to be bogus, too. IIRC,
there are even some sources which switch around the meanings, too. In
any case, the docstring should have a section saying, "If you are
looking for what is called the "unbiased" or "sample" estimate of
variance, use ddof=1." Those terms are widely, if incorrectly, used,
so we should mention them. I just find it disheartening that the terms
are used without qualification.
I compared the doc strings for np.var and np.std. np.std is very brief
on biasedness, and I think the sentence in np.var is relatively
neutral, neither doc string mentions the word unbiased. Your reference
to "unbiased" and "sample" estimator might be useful for users that
are not so familiar with the statistical background. After your
explanation, I would think it would be better to drop any reference to
biasedness in np.std. On the other hand, the scipy doc strings still
need a lot of cleaning and writing.

Josef

Re: [SciPy-user] information on statistical functions

josef.pktd＠gmail.com