[Numpy-discussion] sum and mean methods behaviour

Tue Sep 2 13:33:08 EDT 2003

Hi Todd, 

> I thought about this a lot yesterday and today talked it over with 
> Perry.   There are several ways to fix the problem with mean() and 
> sum(), and I'm hoping that you and the rest of the community will help 
> sort them out. 

It was just an innocent question, I did not think it would have such 
ramifications :-)  

Here are my thoughts: 

If I understand you well, the sum() and mean() array methods are based on the 
reduce method of the universal functions. And these do their calculations in 
the precision of the array, is that correct? 

I also gave this some thought, and I would like to make a distinction between 
a reduction and the calculation of a statistical value such as the mean or the 
sum: 

To me, a reduction means the projection of an multi-dimensional array to an 
array with a rank that is one less than the input. The result is still an 
array, and often I want the result to have the same precision as the input.  

A statistical calculation like a sum or a mean is different: the result should 
be correct and the same irrespective of the type of the input, and that 
mandates using sufficient precision in the calculation. Note however, that 
such a statistic is a scalar result and does not require temporary storage at 
high precision for the whole array. 

So keeping this in mind, my comments to your solutions are: 

> (1) The first "solution" is to require users to do their own up-casting 
> prior to calling mean() or sum().  This gives the end user fine control 
> over storage cost but leaves the C-like pitfall/bug you discovered.   I 
> mention this because this is how the numarray/Numeric reductions are 
> designed.  Is there a reason why the numarray/Numeric reductions don't 
> implicitly up-cast? 

For reductions this behaviour suits me, precisely because it allows control 
over storage, which is one of the strengths of numarray. 

For calculating the mean or the sum of an array this is however an expensive 
solution for a very common operation. I do use this solution, but sometimes I 
prefer an optimized C routine instead. 

> (2) The second way is what you proposed:  use double precision within 
> mean and sum.  This has great simplicity but gives no control over 
> storage usage, and as implemented, the storage would be much higher than 
> one might think, potentially 8x. 

I did not want to suggest to store a casted version of the array before 
calculation of the mean or the sum. That can be done in double precision 
without converting the whole array in memory. I think we can all agree that 
this option would not be a good idea. 

> (3) Lastly, Perry suggested a more radical approach:  rather than 
> changing the mean and sum methods themselves,  we could alter the 
> universal function accumulate and reduce methods to implicitly use 
> additional precision.  Perry's idea was to make all accumulations and 
> reductions up-cast their results to the largest type of the current 
> family, either Bool, Int64, Float64, or Complex64.   By doing this, we 
> can improve the utility of the reductions and accumulations as well as 
> fixing the problem with sum and mean. 

I think that is a great idea in principle, but I think you should consider 
this carefully: 

First of all control of the storage cost is lost when you do a reduction. I 
would not find that always desirable. Thus, I would then like the old 
behaviour for reduction to be accesible either as a different method or by a 
setting an optional argument. 

Additionally, it would not work well for some operations. For instance precise 
calculation of the mean requires floating point precision. Maybe this can be 
solved, but would it require different casting behaviour for different 
operations. That might be too much trouble... 

I would like to propose a fourth option: 

(4) provide separate implementations for array methods like sum() and mean() 
that only calculate the scalar result. No additional storage would be 
necessary and the calculation can be done in double precision. 

I guess that the disadvantage is that one cannot leverage the existing code in 
the ufuncs so easily.  I also assume that it would not be such a general 
solution as changing the reduce method is. 

I do have some experience in writing these sorts of code for multidimensional 
arrays in C and would be happy to contribute code. However, I am not too 
familiar with the internals of numarray library and I don't know how well my 
code fits in there (although I interface all my code to numarray). But I am 
happy to help if I can, numarray is great stuff, it has become the main tool 
for my numerical work. 

Peter 

-- 
Dr. Peter J. Verveer 
Cell Biology and Cell Biophysics Programme 
EMBL 
Meyerhofstrasse 1 
D-69117 Heidelberg 
Germany 
Tel. : +49 6221 387245 
Fax  : +49 6221 387242