[Numpy-discussion] def of var of complex

Wed Jan 9 00:02:05 EST 2008

Charles R Harris wrote:
> 
> 
> On Jan 8, 2008 7:48 PM, Robert Kern <robert.kern at gmail.com 
> <mailto:robert.kern at gmail.com>> wrote:
> 
>     Charles R Harris wrote:
> 
>      > Suppose you have a set of z_i and want to choose z to minimize the
>      > average square error $ \sum_i |z_i - z|^2 $. The solution is that
>      > $z=\mean{z_i}$ and the resulting average error is given by 2).
>     Note that
>      > I didn't mention Gaussians anywhere. No distribution is needed to
>      > justify the argument, just the idea of minimizing the squared
>     distance.
>      > Leaving out the ^2 would yield another metric, or one could ask
>     for a
>      > minmax solution. It is a question of the distance function, not
>      > probability. Anyway, that is one justification for the approach in 2)
>      > and it is one that makes a lot of applied math simple. Whether of
>     not a
>      > least squares fit is useful is different question.
> 
>     If you're not doing probability, then what are you using var() for?
>     I can accept
>     that the quantity is meaningful for your problem, but I'm not
>     convinced it's a
>     variance.
> 
> 
> Lots of fits don't involve probability distributions. For instance, one 
> might want to fit a polynomial to a mathematical curve. This sort of 
> distinction between probability and distance goes back to Gauss himself, 
> although not in his original work on least squares.  Whether or not 
> variance implies probability is a semantic question.

Well, the problem in front of us is entirely semantics: What does the string 
"var(z)" mean? Are we going to choose an mechanistic definition: "var(z) is 
implemented in such and such a way and interpretations are left open"? In that 
case, why are we using the string "var(z)" rather than something else? We're 
also still left with the question as to which such and such implementation to use.

Alternatively, we can look at what people call "variances" and try to implement 
the calculation of such. In that case, the term "variance" tends to crop up (and 
in my experience *only* crop up) in statistics and probability. Certain 
implementations of the calculations of such quantities have cognates elsewhere, 
but those cognates are not themselves called variances.

My question to you is, is "the resulting average error" a variance? I.e., do 
people call it a variance outside of S&P? There are any number of computations 
that are useful but are not variances, and I don't think we should make "var(z)" 
implement them.

In S&P, the single quantity "variance" is well defined for real RVs, even if you 
step away from Gaussians. It's the second central moment of the PDF of the RV. 
When you move up to CC (or RR^2), the definition of "moment" changes. It's no 
longer a real number or even a scalar; the second central moment is a covariance 
matrix. If we're going to call something "the variance", that's it. The 
circularly symmetric forms are special cases. Although option #2 is a useful 
quantity to calculate in some circumstances, I think it's bogus to give it a 
special status.

> I think if we are 
> going to compute a single number,  2) is as good as anything even if it 
> doesn't capture the shape of the scatter plot. A 2D covariance wouldn't 
> necessarily capture the shape either.

True, but it is clear exactly what it is. The function is named "cov()", and it 
computes covariances. It's not called "shape_of_2D_pdf()". Whether or not one 
ought to compute a covariance is not "cov()"'s problem.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco