Hi Todd,

I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out.

It was just an innocent question, I did not think it would have such ramifications :-) Here are my thoughts: If I understand you well, the sum() and mean() array methods are based on the reduce method of the universal functions. And these do their calculations in the precision of the array, is that correct? I also gave this some thought, and I would like to make a distinction between a reduction and the calculation of a statistical value such as the mean or the sum: To me, a reduction means the projection of an multi-dimensional array to an array with a rank that is one less than the input. The result is still an array, and often I want the result to have the same precision as the input. A statistical calculation like a sum or a mean is different: the result should be correct and the same irrespective of the type of the input, and that mandates using sufficient precision in the calculation. Note however, that such a statistic is a scalar result and does not require temporary storage at high precision for the whole array. So keeping this in mind, my comments to your solutions are:

(1) The first "solution" is to require users to do their own up-casting prior to calling mean() or sum(). This gives the end user fine control over storage cost but leaves the C-like pitfall/bug you discovered. I mention this because this is how the numarray/Numeric reductions are designed. Is there a reason why the numarray/Numeric reductions don't implicitly up-cast?

For reductions this behaviour suits me, precisely because it allows control over storage, which is one of the strengths of numarray. For calculating the mean or the sum of an array this is however an expensive solution for a very common operation. I do use this solution, but sometimes I prefer an optimized C routine instead.

(2) The second way is what you proposed: use double precision within mean and sum. This has great simplicity but gives no control over storage usage, and as implemented, the storage would be much higher than one might think, potentially 8x.

I did not want to suggest to store a casted version of the array before calculation of the mean or the sum. That can be done in double precision without converting the whole array in memory. I think we can all agree that this option would not be a good idea.

(3) Lastly, Perry suggested a more radical approach: rather than changing the mean and sum methods themselves, we could alter the universal function accumulate and reduce methods to implicitly use additional precision. Perry's idea was to make all accumulations and reductions up-cast their results to the largest type of the current family, either Bool, Int64, Float64, or Complex64. By doing this, we can improve the utility of the reductions and accumulations as well as fixing the problem with sum and mean.

I think that is a great idea in principle, but I think you should consider this carefully: First of all control of the storage cost is lost when you do a reduction. I would not find that always desirable. Thus, I would then like the old behaviour for reduction to be accesible either as a different method or by a setting an optional argument. Additionally, it would not work well for some operations. For instance precise calculation of the mean requires floating point precision. Maybe this can be solved, but would it require different casting behaviour for different operations. That might be too much trouble... I would like to propose a fourth option: (4) provide separate implementations for array methods like sum() and mean() that only calculate the scalar result. No additional storage would be necessary and the calculation can be done in double precision. I guess that the disadvantage is that one cannot leverage the existing code in the ufuncs so easily. I also assume that it would not be such a general solution as changing the reduce method is. I do have some experience in writing these sorts of code for multidimensional arrays in C and would be happy to contribute code. However, I am not too familiar with the internals of numarray library and I don't know how well my code fits in there (although I interface all my code to numarray). But I am happy to help if I can, numarray is great stuff, it has become the main tool for my numerical work. Peter -- Dr. Peter J. Verveer Cell Biology and Cell Biophysics Programme EMBL Meyerhofstrasse 1 D-69117 Heidelberg Germany Tel. : +49 6221 387245 Fax : +49 6221 387242