Robert Kern wrote:
Sebastian Haase wrote:
Robert Kern wrote:
Sebastian Haase wrote:
I know that having too much knowledge of the details often makes one forget what the "newcomers" will do and expect.
Please be more careful with such accusations. Repeated frequently, they can become quite insulting.
I did not mean to insult anyone - what I meant was, that I'm for numpy becoming an easy platform to use. I have spend and enjoyed part the last four years developing and evangelizing Python as an alternative to Matlab and C/Fortran based image analysis environment. I often find myself arguing for good support of the single precision data format. So I find it actually somewhat ironic to see myself arguing now for wanting float64 over float32 ;-)
No one is doubting that you want numpy to be easy to use. Please don't doubt that the rest of us want otherwise. However, the fact that you *want* numpy to be easy to use does not mean that your suggestions *will* make numpy easy to use.
We haven't forgotten what newcomers will do; to the contrary, we are quite aware that new users need consistent behavior in order to learn how to use a system. Adding another special case in how dtypes implicitly convert to one another will impede new users being able to understand the whole system. See A. M. Archibald's question in the thread "ufunc.reduce and conversion" for an example. In our judgement this is a worse outcome than notational convenience for float32 users, who already need to be aware of the effects of their precision choice. Each of us can come to different conclusions in good faith without one of us forgetting the new user experience.
Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python:
def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m
def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t
Alternatively, from Knuth:
def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance
If you will code up implementations of these for ndarray.mean() and ndarray.var(), I will check them in and then float32 arrays will avoid most of the catastrophes that the current implementations run into.
+1
We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to "just use the default" (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr.
I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that.
This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such.
They need to appreciate the complexity of computational numerics if they are going to do numerical computation. Double precision does not make it any simpler.