Re: [Numpy-discussion] please change mean to use dtype=float

22 Sep 2006


      On 9/22/06, Tim Hochberg  wrote:
...
...
On Thursday 21 September 2006 15:28, Tim Hochberg wrote:
...
David M. Cooke wrote:
...
On Thu, 21 Sep 2006 11:34:42 -0700
Tim Hochberg  wrote:
...
Tim Hochberg wrote:
...
Robert Kern wrote:
> David M. Cooke wrote:
>
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>
>>> Let me offer a third path: the algorithms used for .mean() and
>>> .var() are substandard. There are much better incremental
algorithms
>>> that entirely avoid the need to accumulate such large (and
...
...
...
...
...
>>> precision-losing) intermediate values. The algorithms look like
Sebastian Haase wrote:
therefore
the
...
...
...
...
...
>>> following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>      m = a[0]
>>>      for i in range(1, len(a)):
>>>          m += (a[i] - m) / (i + 1)
>>>      return m
>>>
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>
> With one exception, every test that I've thrown at it shows that
it's
> better for float32. That exception is uniformly spaced arrays, like
> linspace().
>
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a)
>  > and adding that will do the same.
>
> Okay, this is true.
>
>
>> Now, if you want to avoid losing precision, you want to use a
better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>>     s = e = a.dtype.type(0)
>>     for i in range(0, len(a)):
>>         temp = s
>>         y = a[i] + e
>>         s = temp + y
>>         e = (temp - s) + y
>>     return s / len(a)
>>
>>
>>> def var(a):
>>>      m = a[0]
>>>      t = a.dtype.type(0)
>>>      for i in range(1, len(a)):
>>>          q = a[i] - m
>>>          r = q / (i+1)
>>>          m += r
>>>          t += i * q * r
>>>      t /= len(a)
>>>      return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>      m = a.dtype.type(0)
>>>      variance = a.dtype.type(0)
>>>      for i in range(len(a)):
>>>          delta = a[i] - m
>>>          m += delta / (i+1)
>>>          variance += delta * (a[i] - m)
>>>      variance /= len(a)
>>>      return variance
>>>
I'm going to go ahead and attach a module containing the versions of
mean, var, etc that I've been playing with in case someone wants to
mess
with them. Some were stolen from traffic on this list, for others I
grabbed the algorithms from wikipedia or equivalent.
I looked into this a bit more. I checked float32 (single precision)
and
float64 (double precision), using long doubles (float96) for the
"exact"
results. This is based on your code. Results are compared using
abs(exact_stat - computed_stat) / max(abs(values)), with 10000 values
in
the range of [-100, 900]
First, the mean. In float32, the Kahan summation in single precision
is
better by about 2 orders of magnitude than simple summation. However,
accumulating the sum in double precision is better by about 9 orders
of
magnitude than simple summation (7 orders more than Kahan).
In float64, Kahan summation is the way to go, by 2 orders of
magnitude.
For the variance, in float32, Knuth's method is *no better* than the
two-pass method. Tim's code does an implicit conversion of
intermediate
results to float64, which is why he saw a much better result.
Doh! And I fixed that same problem in the mean implementation earlier
too. I was astounded by how good knuth was doing, but not astounded
enough apparently.
Does it seem weird to anyone else that in:
    numpy_scalar <op> python_scalar
the precision ends up being controlled by the python scalar? I would
expect the numpy_scalar to control the resulting precision just like
numpy arrays do in similar circumstances. Perhaps the egg on my face is
just clouding my vision though.
...
The two-pass method using
Kahan summation (again, in single precision), is better by about 2
orders
of magnitude. There is practically no difference when using a
double-precision accumulator amongst the techniques: they're all about
9
orders of magnitude better than single-precision two-pass.
In float64, Kahan summation is again better than the rest, by about 2
orders of magnitude.
I've put my adaptation of Tim's code, and box-and-whisker plots of the
results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/
Conclusions:
- If you're going to calculate everything in single precision, use
Kahan
summation. Using it in double-precision also helps.
- If you can use a double-precision accumulator, it's much better than
any of the techniques in single-precision only.
- for speed+precision in the variance, either use Kahan summation in
single precision with the two-pass method, or use double precision
with
simple summation with the two-pass method. Knuth buys you nothing,
except
slower code :-)
The two pass methods are definitely more accurate. I won't be convinced
on the speed front till I see comparable C implementations slug it out.
That may well mean never in practice. However, I expect that somewhere
around 10,000 items, the cache will overflow and memory bandwidth will
become the bottleneck. At that point the extra operations of Knuth
won't
matter as much as making two passes through the array and Knuth will
win
on speed. Of course the accuracy is pretty bad at single precision, so
the possible, theoretical speed advantage at large sizes probably
doesn't matter.
-tim
...
After 1.0 is out, we should look at doing one of the above.
+1
Hi,
I don't understand much of these "fancy algorithms", but does the above
mean
that
after 1.0 is released the float32 functions will catch up in terms of
precision precision ("to some extend") with using dtype=float64 ?
It sounds like there is some consensus to do something to improve the
precision after 1.0 comes out. I don't think the details are entirely
nailed down, but it sounds like some combination of using Kahan
summation and increasing the minimum precision of the accumulator is
likely for sum, mean, var and std.
I would go with double as the default except when the base precision is
greater. Higher precisions should be available as a check, i.e., if the
results look funny use Kahan or extended to see if roundoff is a problem. I
think Kahan should be available because it adds precision to *any* base
precision, even extended or quad, so there is always something to check
against. But I don't think Kahan should be the default because of the speed
hit.

Chuck

Re: [Numpy-discussion] please change mean to use dtype=float

Charles R Harris