Overloading numpy's ufuncs for better type coercion?
Hi! (This mail is a reply to a personal conversation with Ullrich Köthe, but is obviously of a greater concern. This is about VIGRA's new NumPy-based python bindings.) Ulli considers this behaviour of NumPy to be a bug: In [1]: a = numpy.array([200], numpy.uint8) In [2]: a + a Out[2]: array([144], dtype=uint8) However, this is well-known, often-discussed, and IMHO not really unexpected for computer programmers who ever worked with C-like languages (even Java has many such problems). Christian even said this is what he wants. OTOH, I agree that it is a missing feature that NumPy performs "coercion before the operation" (to be more precise: the temporary data type should be promoted from the operand types, and /then/ the coercion - which can also reduce the number of bits - should happen), to fix this strange behaviour: In [3]: numpy.add(a, a, numpy.empty((1, ), dtype = numpy.uint32)) Out[3]: array([144], dtype=uint32) Now, our opinions differ on how to deal with this - Ulli planned to overwrite (more or less) all ufuncs in vigranumpy in order to return float32 (which is sort of a common denominator and the type nearly all other vigranumpy functions should accept). I see two main disadvantages here: a) Choosing float32 seems to be arbitrary, and I'd like as much as possible of vigranumpy to be independent from VIGRA and its particular needs. I have seen so many requests (e.g. on the c++-sig mailing list) for *good* C++/boost- python <-> numpy bindings that it would be a pity IMO to add special cases for VIGRA by overloading __add__ etc. b) Also, I find it unexpected and undesirable to change the behaviour of such basic operations as addition on our ndarray-derived image types. IMO this brings the danger of new users being confused about the different behaviours, and even experienced vigranumpy users might eventually fall in the trap when dealing with plain ndarrays and our derived types side-by-side. Ideally, I'd like numpy to be "fixed" - I hope that the "only" obstacle is that someone needs to do it, but I am afraid of someone mentioning the term "backward-compatibility" (Ulli would surely call it "bug-compatibility" here ;-) ). But in the end, I wonder how bad this really is for the VIGRA. AFAICS, the main problem is that one needs to decide upon the pixel types for which to export algorithms (in most cases, we'll use just float32, at least for a start), and that when one loads images into arrays of the data type used in the image file, one will often end up with uint8 arrays which cannot be passed into many algorithms without an explicit conversion. However, is this really a bad problem? For example, the conversion would typically have to be performed only once (after loading), no? Then, why not simplify things further by adding a dtype= parameter to importImage()? This could even default to float32 then. Looking forward to your opinions, Hans
Hi Hans!
Ideally, I'd like numpy to be "fixed" what do you mean by "fixed"? Are you referring to Out[2] or Out[3]?
In [1]: a = numpy.array([200], numpy.uint8) In [2]: a + a Out[2]: array([144], dtype=uint8) Please do not "fix" this, that IS the correct output. What should numpy do? Promote every sum of arrays of uint8 to uint16? Or perform the operation as uint16 and cast it back to uint8 only if all elements are less than 256, therefore having the same operation on arrays of the same type return an unpredictable data type? I think the answer is simple: a = numpy.array([200]) if you want an integer a = numpy.array([200.]) if you want a float. These are pretty safe for reasonable inputs. Whenever the user writes: a = numpy.array([200], dtype=...) it means he/she knows what they are doing. If instead, you refer to In [3]: numpy.add(a, a, numpy.empty((1, ), dtype = numpy.uint32)) Out[3]: array([144], dtype=uint32) in this case I agree with you, the expected result should be 400. The inputs could be casted to the output type before performing the operation. I do not think performing the operations with the output dtype would break something. Even in borderline cases like the following:
b = numpy.array([400], numpy.int16) c = numpy.array([200], numpy.int16) numpy.subtract(b.astype(numpy.int8), c.astype(numpy.int8), numpy.empty((1, ), dtype = numpy.int8)) array([100], dtype=int8)
Best, Luca
On Wednesday 22 July 2009 15:14:32 Citi, Luca wrote:
In [2]: a + a Out[2]: array([144], dtype=uint8)
Please do not "fix" this, that IS the correct output.
No, I did not mean to fix this. (Although it should be noted that in C/C++, the result of uint8+uint8 is int.)
If instead, you refer to In [3]: numpy.add(a, a, numpy.empty((1, ), dtype = numpy.uint32)) Out[3]: array([144], dtype=uint32) in this case I agree with you, the expected result should be 400.
Yes, that's what I meant.
The inputs could be casted to the output type before performing the operation. I do not think performing the operations with the output dtype would break something. Even in borderline cases like the following:
b = numpy.array([400], numpy.int16) c = numpy.array([200], numpy.int16) numpy.subtract(b.astype(numpy.int8), c.astype(numpy.int8), numpy.empty((1, ), dtype = numpy.int8))
Indeed - I thought we had to take more care, but will this also work with int<->float conversions? No: In [22]: a, b = numpy.array([[8.6,4.9]]).T In [23]: numpy.subtract(a, b, numpy.empty((1, ), dtype = numpy.uint8)) Out[23]: array([3], dtype=uint8) In [24]: numpy.subtract(a.astype(numpy.uint8), b.astype(numpy.uint8)) Out[24]: array([4], dtype=uint8) However, I admit that this is a contrived example. ;-) Greetings, Hans
Hello Hans,
Although it should be noted that in C/C++, the result of uint8+uint8 is int. But C/C++ works with scalars and often temporary results are kept in registers. On the contrary, numpy works with arrays. We cannot expect (a+b)*c to grow from uint8 to uint16 and then uint32 :-D
For example, the conversion would typically have to be performed only once (after loading), no? Then, why not simplify things further by adding a dtype= parameter to importImage()? This could even default to float32 then.
I think this is the cleanest, easiest, better way to go. Also, Matlab's imread reads the image in its format (e.g. uint8). An explicit conversion is needed (e.g. double()) otherwise all the operations are performed in uint8. Even i = imread('...'); i = i + 34.5 does not trigger an upcast. The only difference is that matlab saturates to 0 and 255 instead of wrapping. Best, Luca
Hans Meine wrote:
In [3]: numpy.add(a, a, numpy.empty((1, ), dtype = numpy.uint32)) Out[3]: array([144], dtype=uint32)
yes, it sure would be nice to fix this...
one will often end up with uint8 arrays which cannot be passed into many algorithms without an explicit conversion. However, is this really a bad problem? For example, the conversion would typically have to be performed only once (after loading), no? Then, why not simplify things further by adding a dtype= parameter to importImage()? This could even default to float32 then.
VIGRA specifically, this sounds like a fine way to go. how ever, for the broader numpy case: I want to add two unit8 arrays, and put the results into a int32 array (for instance). As pointed out this doesn't work: In [9]: a1 = np.array((200,), dtype= np.uint8) In [10]: a2 = np.array((250,), dtype= np.uint8) In [11]: a1 + a2 Out[11]: array([194], dtype=uint8) As pointed out by others -- this is the "right" behavior - we really don't want upcasting without asking for it. However, as above: In [15]: np.add(a1, a2, np.empty(a1.shape, dtype = np.int32)) Out[15]: array([194]) I am asking for upcasting, so it's too bad I don't get it. The solution is to upcast ahead of time: In [17]: np.add(a1.astype(np.int32), a2, np.empty(a1.shape, dtype = np.int32)) Out[17]: array([450]) or simply: In [18]: a1.astype(np.int32) +a2 Out[18]: array([450]) Easy enough. The one downside is that an extra temporary in the larger type is needed (but only one). That could be an issue when working with large arrays, which I suspect is the case when these issues come up -- no one use the np.add() notation unless you are trying to manage memory more carefully. Another way to write this is: In [19]: a3 = a1.astype(np.int32) In [20]: a3 += a2 In [21]: a3 Out[21]: array([450]) which I think avoids the extra temporary This is a pretty special case, ands there ways to accomplish what is needed. I suspect that's why no one has "fixed" this yet. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
2009/7/22 Hans Meine <meine@informatik.uni-hamburg.de>
Hi!
(This mail is a reply to a personal conversation with Ullrich Köthe, but is obviously of a greater concern. This is about VIGRA's new NumPy-based python bindings.) Ulli considers this behaviour of NumPy to be a bug:
In [1]: a = numpy.array([200], numpy.uint8)
In [2]: a + a Out[2]: array([144], dtype=uint8)
However, this is well-known, often-discussed, and IMHO not really unexpected for computer programmers who ever worked with C-like languages (even Java has many such problems). Christian even said this is what he wants.
OTOH, I agree that it is a missing feature that NumPy performs "coercion before the operation" (to be more precise: the temporary data type should be promoted from the operand types, and /then/ the coercion - which can also reduce the number of bits - should happen), to fix this strange behaviour:
In [3]: numpy.add(a, a, numpy.empty((1, ), dtype = numpy.uint32)) Out[3]: array([144], dtype=uint32)
Now, our opinions differ on how to deal with this - Ulli planned to overwrite (more or less) all ufuncs in vigranumpy in order to return float32 (which is sort of a common denominator and the type nearly all other vigranumpy functions should accept). I see two main disadvantages here:
a) Choosing float32 seems to be arbitrary, and I'd like as much as possible of vigranumpy to be independent from VIGRA and its particular needs. I have seen so many requests (e.g. on the c++-sig mailing list) for *good* C++/boost- python <-> numpy bindings that it would be a pity IMO to add special cases for VIGRA by overloading __add__ etc.
b) Also, I find it unexpected and undesirable to change the behaviour of such basic operations as addition on our ndarray-derived image types. IMO this brings the danger of new users being confused about the different behaviours, and even experienced vigranumpy users might eventually fall in the trap when dealing with plain ndarrays and our derived types side-by-side.
Ideally, I'd like numpy to be "fixed" - I hope that the "only" obstacle is that someone needs to do it, but I am afraid of someone mentioning the term "backward-compatibility" (Ulli would surely call it "bug-compatibility" here ;-) ).
The rule here is that the output is computed as the common input type, then cast to the type of the output array. It will also be downcast if the output array is of lesser precision. This has been discussed on the list, my preference would have been to raise an error on mismatched types but the consensus was to keep the current casting behaviour. What you are asking for is that the ufunc upcast the arrays before the addition if the output array is of higher precision. This complicates the logic. I think it is better in this case to explicitly make a of higher precision before the call. Note that the ufunc loops don't generally handle mixed types with the exception of logical functions and, IIRC, the exponent/mantissa functions. Chuck
participants (4)
-
Charles R Harris
-
Christopher Barker
-
Citi, Luca
-
Hans Meine