[Numpy-discussion] calculating the mean and variance of a large float vector
Bruce Southey
bsouthey at gmail.com
Fri Jun 6 09:56:50 EDT 2008
Alan McIntyre wrote:
> On Thu, Jun 5, 2008 at 10:16 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>
>> How can that lead to instability? If the last half-million values are
>> small then they won't have a big impact on the mean even if they are
>> ignored. The variance is a mean too (of the squares), so it should be
>> stable too. Or am I, once again, missing the point?
>>
>
> No, I just didn't think about it long enough, and I shouldn't have
> tried to make an example off the cuff. ;) After thinking about it
> again, I think some loss of accuracy is probably the worst that can
> happen.
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
Any problems are going to mainly due to the distribution of numbers
especially if there are very small numbers and very large numbers. This
is mitigated by numerical precision and algorithm - my guess is that it
will take a rather extreme case to cause you any problems.
Python and NumPy are already using high numerical precision (may depend
on architecture) and NumPy defines 32-bit, 64-bit and 128-bit precision
if you want to go higher (or lower). This means that calculations are
rather insensitive to numbers used so typically there is no reason for
any concern (ignoring the old Pentium FDIV bug,
http://en.wikipedia.org/wiki/Pentium_FDIV_bug ).
The second issue is the algorithm where you need to balance performance
with precision. For simple calculations:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Bruce
More information about the NumPy-Discussion
mailing list