Hi All, I noticed that the sum() and mean() methods of numarrays use the precision of the given array in their calculations. That leads to resuls like this:
array([255, 255], Int8).sum() -2 array([255, 255], Int8).mean() -1.0
Would it not be better to use double precision internally and return the correct result? Cheers, Peter -- Dr. Peter J. Verveer Cell Biology and Cell Biophysics Programme EMBL Meyerhofstrasse 1 D-69117 Heidelberg Germany Tel. : +49 6221 387245 Fax : +49 6221 387242 Email: Peter.Verveer@embl-heidelberg.de
On Mon, 2003-09-01 at 05:34, Peter Verveer wrote:
Hi All,
I noticed that the sum() and mean() methods of numarrays use the precision of the given array in their calculations. That leads to resuls like this:
array([255, 255], Int8).sum() -2 array([255, 255], Int8).mean() -1.0
Would it not be better to use double precision internally and return the correct result?
Cheers, Peter
Hi Peter, I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out. (1) The first "solution" is to require users to do their own up-casting prior to calling mean() or sum(). This gives the end user fine control over storage cost but leaves the C-like pitfall/bug you discovered. I mention this because this is how the numarray/Numeric reductions are designed. Is there a reason why the numarray/Numeric reductions don't implicitly up-cast? (2) The second way is what you proposed: use double precision within mean and sum. This has great simplicity but gives no control over storage usage, and as implemented, the storage would be much higher than one might think, potentially 8x. (3) Lastly, Perry suggested a more radical approach: rather than changing the mean and sum methods themselves, we could alter the universal function accumulate and reduce methods to implicitly use additional precision. Perry's idea was to make all accumulations and reductions up-cast their results to the largest type of the current family, either Bool, Int64, Float64, or Complex64. By doing this, we can improve the utility of the reductions and accumulations as well as fixing the problem with sum and mean. -- Todd Miller jmiller@stsci.edu STSCI / ESS / SSB
Todd Miller wrote:
I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out.
[snip] Just a thought: why not make the upcasting an optional parameter? I've found that python's arguments with default values provide a very convenient way of giving the user fine control with minimal conceptual overhead. I'd rather write: arr = array([255, 255], Int8) ... later arr.sum(use_double=1) # or some similar way of tuning sum() than arr = array([255, 255], Int8) ... later array(arr,typecode='d').sum() Numarray/numpy are trying to tackle an inherently hard problem: matching the high-level comfort of python with the low-level performance of C. This situation is an excellent example of what I've seen described as the 'law of leaky abstractions': in most cases where you encapsulate low level details in a high level abstraction, there end up being situations where the details poke through the abstraction and cause you grief. This is an inherently tricky problem, with no easy, universal solution (that I'm aware of). Cheers, f.
On Tue, Sep 02, 2003 at 01:20:17PM -0600, Fernando Perez wrote:
Todd Miller wrote:
I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out.
[snip]
Just a thought: why not make the upcasting an optional parameter?
I've found that python's arguments with default values provide a very convenient way of giving the user fine control with minimal conceptual overhead.
I'd rather write:
arr = array([255, 255], Int8) ... later arr.sum(use_double=1) # or some similar way of tuning sum()
+1, but arr.sum(typecode=Float64) would be my choice of spelling. Not sure what the default typecode should be, though. Probably Perry's suggestion: the largest type of the family. -- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
On Tue, 2003-09-02 at 15:20, Fernando Perez wrote:
Todd Miller wrote:
I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out.
[snip]
Just a thought: why not make the upcasting an optional parameter?
<snip> That sounds like a great idea. Simple, but doesn't throw out all storage control.
in most cases where you encapsulate low level details in a high level abstraction, there end up being situations where the details poke through the abstraction and cause you grief.
Thanks for these kind words. -- Todd Miller <jmiller@stsci.edu>
Hi Todd,
I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out.
It was just an innocent question, I did not think it would have such ramifications :-) Here are my thoughts: If I understand you well, the sum() and mean() array methods are based on the reduce method of the universal functions. And these do their calculations in the precision of the array, is that correct? I also gave this some thought, and I would like to make a distinction between a reduction and the calculation of a statistical value such as the mean or the sum: To me, a reduction means the projection of an multi-dimensional array to an array with a rank that is one less than the input. The result is still an array, and often I want the result to have the same precision as the input. A statistical calculation like a sum or a mean is different: the result should be correct and the same irrespective of the type of the input, and that mandates using sufficient precision in the calculation. Note however, that such a statistic is a scalar result and does not require temporary storage at high precision for the whole array. So keeping this in mind, my comments to your solutions are:
(1) The first "solution" is to require users to do their own up-casting prior to calling mean() or sum(). This gives the end user fine control over storage cost but leaves the C-like pitfall/bug you discovered. I mention this because this is how the numarray/Numeric reductions are designed. Is there a reason why the numarray/Numeric reductions don't implicitly up-cast?
For reductions this behaviour suits me, precisely because it allows control over storage, which is one of the strengths of numarray. For calculating the mean or the sum of an array this is however an expensive solution for a very common operation. I do use this solution, but sometimes I prefer an optimized C routine instead.
(2) The second way is what you proposed: use double precision within mean and sum. This has great simplicity but gives no control over storage usage, and as implemented, the storage would be much higher than one might think, potentially 8x.
I did not want to suggest to store a casted version of the array before calculation of the mean or the sum. That can be done in double precision without converting the whole array in memory. I think we can all agree that this option would not be a good idea.
(3) Lastly, Perry suggested a more radical approach: rather than changing the mean and sum methods themselves, we could alter the universal function accumulate and reduce methods to implicitly use additional precision. Perry's idea was to make all accumulations and reductions up-cast their results to the largest type of the current family, either Bool, Int64, Float64, or Complex64. By doing this, we can improve the utility of the reductions and accumulations as well as fixing the problem with sum and mean.
I think that is a great idea in principle, but I think you should consider this carefully: First of all control of the storage cost is lost when you do a reduction. I would not find that always desirable. Thus, I would then like the old behaviour for reduction to be accesible either as a different method or by a setting an optional argument. Additionally, it would not work well for some operations. For instance precise calculation of the mean requires floating point precision. Maybe this can be solved, but would it require different casting behaviour for different operations. That might be too much trouble... I would like to propose a fourth option: (4) provide separate implementations for array methods like sum() and mean() that only calculate the scalar result. No additional storage would be necessary and the calculation can be done in double precision. I guess that the disadvantage is that one cannot leverage the existing code in the ufuncs so easily. I also assume that it would not be such a general solution as changing the reduce method is. I do have some experience in writing these sorts of code for multidimensional arrays in C and would be happy to contribute code. However, I am not too familiar with the internals of numarray library and I don't know how well my code fits in there (although I interface all my code to numarray). But I am happy to help if I can, numarray is great stuff, it has become the main tool for my numerical work. Peter -- Dr. Peter J. Verveer Cell Biology and Cell Biophysics Programme EMBL Meyerhofstrasse 1 D-69117 Heidelberg Germany Tel. : +49 6221 387245 Fax : +49 6221 387242
participants (5)
-
Fernando Perez
-
Peter Verveer
-
Robert Kern
-
Todd Miller
-
verveer@embl.de