[Numpy-discussion] numpy.mean still broken for large float32 arrays

Thu Jul 24 06:55:07 EDT 2014

I don't agree. The problem is that I expect `mean` to do something 
reasonable. The documentation mentions that the results can be 
"inaccurate", which is a huge understatement: the results can be utterly 
wrong. That is not reasonable. At the very least, a warning should be 
issued in cases where the dtype might not be appropriate.

One cannot predict what input sizes a program will be run with once it's 
in use (especially if it's in use for several years). I'd argue this is 
true for pretty much every code except quick one-off scripts. Thus one 
would have to use  `dtype=np.float64` everywhere.  By which point it 
seems obvious that it should have been the default in the first place. 
The other alternative would be to extend np.mean with some logic that 
internally figures out the right thing to do (which I don't think is too 
hard, since ).

Your example with the short axis is something that can be checked for. I 
agree that the logic could become a bit hairy, but not too much: If we 
are going to sum up more than N values (where N could be determined at 
compile time, or simply be some constant), we upcast unless the user 
explicitly specified a dtype. Of course, this would incur an increase in 
memory. However I'd argue that it's not even a large increase: If you 
can fit the matrix in memory, then allocating a row/column of float64 
instead of float32 should be doable, as well. And I'd much rather get an 
OutOfMemory execption than silently continue my calculations with 
useless/wrong results.

Cheers

Thomas

On 2014-07-24 11:59, Eelco Hoogendoorn wrote:
> Arguably, this isn't a problem of numpy, but of programmers being 
> trained to think of floating point numbers as 'real' numbers, rather 
> than just a finite number of states with a funny distribution over the 
> number line. np.mean isn't broken; your understanding of floating 
> point number is.
>
> What you appear to wish for is a silent upcasting of the accumulated 
> result. This is often performed in reducing operations, but I can 
> imagine it runs into trouble for nd-arrays. After all, if I have a 
> huge array that I want to reduce over a very short axis, upcasting 
> might be very undesirable; it wouldn't buy me any extra precision, but 
> it would increase memory use from 'huge' to 'even more huge'.
>
> np.mean has a kwarg that allows you to explicitly choose the dtype of 
> the accumulant. X.mean(dtype=np.float64)==1.0. Personally, I have a 
> distaste for implicit behavior, unless the rule is simple and there 
> really can be no negative downsides; which doesn't apply here I would 
> argue. Perhaps when reducing an array completely to a single value, 
> there is no harm in upcasting to the maximum machine precision; but 
> that becomes a rather complex rule which would work out differently 
> for different machines. Its better to be confronted with the 
> limitations of floating point numbers earlier, rather than later when 
> you want to distribute your work and run into subtle bugs on other 
> peoples computers.
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140724/66d1d7c7/attachment.html>