[Numpy-discussion] calculating the mean and variance of a large float vector

Fri Jun 6 09:56:50 EDT 2008

Alan McIntyre wrote:
> On Thu, Jun 5, 2008 at 10:16 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>   
>> How can that lead to instability? If the last half-million values are
>> small then they won't have a big impact on the mean even if they are
>> ignored. The variance is a mean too (of the squares), so it should be
>> stable too. Or am I, once again, missing the point?
>>     
>
> No, I just didn't think about it long enough, and I shouldn't have
> tried to make an example off the cuff. ;)   After thinking about it
> again, I think some loss of accuracy is probably the worst that can
> happen.
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>   
Any problems are going to mainly due to the distribution of numbers 
especially if there are very small numbers and very large numbers. This 
is mitigated by numerical precision and algorithm - my guess is that it 
will take a rather extreme case to cause you any problems.

Python and NumPy are already using high numerical precision (may depend 
on architecture)  and NumPy defines 32-bit, 64-bit and 128-bit precision 
if you want to go higher (or lower). This means that calculations are 
rather insensitive to numbers used so typically there is no reason for 
any concern (ignoring the old Pentium FDIV bug, 
http://en.wikipedia.org/wiki/Pentium_FDIV_bug ).

The second issue is the algorithm where you need to balance performance 
with precision. For simple calculations:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Bruce