[Numpy-discussion] please change mean to use dtype=float

Wed Sep 20 04:01:18 EDT 2006

Sebastian Haase wrote:
> Robert Kern wrote:
>> Sebastian Haase wrote:
>>> I know that having too much knowledge of the details often makes one 
>>> forget what the "newcomers" will do and expect.
>> Please be more careful with such accusations. Repeated frequently, they can 
>> become quite insulting.
>>
> I did not mean to insult anyone - what I meant was, that I'm for numpy 
> becoming an easy platform to use. I have spend and enjoyed part the last 
> four years developing and evangelizing Python as an alternative to 
> Matlab and C/Fortran based image analysis environment. I often find 
> myself arguing for good support of the single precision data format. So 
> I find it actually somewhat ironic to see myself arguing now for wanting 
> float64 over float32 ;-)

No one is doubting that you want numpy to be easy to use. Please don't doubt 
that the rest of us want otherwise. However, the fact that you *want* numpy to 
be easy to use does not mean that your suggestions *will* make numpy easy to use.

We haven't forgotten what newcomers will do; to the contrary, we are quite aware 
that new users need consistent behavior in order to learn how to use a system. 
Adding another special case in how dtypes implicitly convert to one another will 
impede new users being able to understand the whole system. See A. M. 
Archibald's question in the thread "ufunc.reduce and conversion" for an example. 
In our judgement this is a worse outcome than notational convenience for float32 
users, who already need to be aware of the effects of their precision choice. 
Each of us can come to different conclusions in good faith without one of us 
forgetting the new user experience.

Let me offer a third path: the algorithms used for .mean() and .var() are 
substandard. There are much better incremental algorithms that entirely avoid 
the need to accumulate such large (and therefore precision-losing) intermediate 
values. The algorithms look like the following for 1D arrays in Python:

def mean(a):
     m = a[0]
     for i in range(1, len(a)):
         m += (a[i] - m) / (i + 1)
     return m

def var(a):
     m = a[0]
     t = a.dtype.type(0)
     for i in range(1, len(a)):
         q = a[i] - m
         r = q / (i+1)
         m += r
         t += i * q * r
     t /= len(a)
     return t

Alternatively, from Knuth:

def var_knuth(a):
     m = a.dtype.type(0)
     variance = a.dtype.type(0)
     for i in range(len(a)):
         delta = a[i] - m
         m += delta / (i+1)
         variance += delta * (a[i] - m)
     variance /= len(a)
     return variance

If you will code up implementations of these for ndarray.mean() and 
ndarray.var(), I will check them in and then float32 arrays will avoid most of 
the catastrophes that the current implementations run into.

>>> We are only talking 
>>> about people that will a) work with single-precision data (e.g. large 
>>> scale-image analysis) and who b) will tend to "just use the default" 
>>> (dtype)  --- How else can I say this: these people will just assume that 
>>> arr.mean() *is* the mean of arr.
>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
>> arr. Double precision can't fix that.
>>
> This was not supposed to be a scientific statement -- I'm (again) 
> thinking of our students that not always appreciate the full complexity 
> of computational numerics and data types and such.

They need to appreciate the complexity of computational numerics if they are 
going to do numerical computation. Double precision does not make it any simpler.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco