Scalar casting rules use-case reprise
Hi, Reading the discussion on the scalar casting rule change I realized I was hazy on the use-cases that led to the rule that scalars cast differently from arrays. My impression was that the primary use-case was for lower-precision floats. That is, when you have a large float32 arr, you do not want to double your memory use with:
large_float32 + 1.0 # please no float64 here
Probably also:
large_int8 + 1 # please no int32 / int64 here.
That makes sense. On the other hand these are more ambiguous:
large_float32 + np.float64(1) # really - you don't want float64?
large_int8 + np.int32(1) # ditto
I wonder whether the main use-case was to deal with the automatic types of Python floats and scalars? That is, I wonder whether it would be worth considering (in the distant long term), doing fancy guess-what-you-mean stuff with Python scalars, on the basis that they are of unspecified dtype, and make 0 dimensional scalars follow the array casting rules. As in:
large_float32 + 1.0 # no upcast - we don't know what float type you meant for the scalar large_float32 + np.float64(1) # upcast - you clearly meant the scalar to be float64
In any case, can anyone remember the original use-cases well enough to record them for future decision making? Best, Matthew
On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett
Hi,
Reading the discussion on the scalar casting rule change I realized I was hazy on the use-cases that led to the rule that scalars cast differently from arrays.
My impression was that the primary use-case was for lower-precision floats. That is, when you have a large float32 arr, you do not want to double your memory use with:
large_float32 + 1.0 # please no float64 here
Probably also:
large_int8 + 1 # please no int32 / int64 here.
That makes sense. On the other hand these are more ambiguous:
large_float32 + np.float64(1) # really - you don't want float64?
large_int8 + np.int32(1) # ditto
I wonder whether the main use-case was to deal with the automatic types of Python floats and scalars? That is, I wonder whether it would be worth considering (in the distant long term), doing fancy guess-what-you-mean stuff with Python scalars, on the basis that they are of unspecified dtype, and make 0 dimensional scalars follow the array casting rules. As in:
large_float32 + 1.0 # no upcast - we don't know what float type you meant for the scalar large_float32 + np.float64(1) # upcast - you clearly meant the scalar to be float64
Hmm, but consider this, which is exactly the operation in your example: In [9]: a = np.arange(3, dtype=np.float32) In [10]: a / np.mean(a) # normalize Out[10]: array([ 0., 1., 2.], dtype=float32) In [11]: type(np.mean(a)) Out[11]: numpy.float64 Obviously the most common situation where it's useful to have the rule to ignore scalar width is for avoiding "width contamination" from Python float and int literals. But you can easily end up with numpy scalars from indexing, high-precision operations like np.mean, etc., where you don't "really mean" you want high-precision. And at least it's easy to understand the rule: same-kind scalars don't affect precision. ...Though arguably the bug here is that np.mean actually returns a value with higher precision. Interestingly, we seem to have some special cases so that if you want to normalize each row of a matrix, then again the dtype is preserved, but for a totally different reasons. In a = np.arange(4, dtype=np.float32).reshape((2, 2)) a / np.mean(a, axis=0, keepdims=True) the result has float32 type, even though this is an array/array operation, not an array/scalar operation. The reason is: In [32]: np.mean(a).dtype Out[32]: dtype('float64') But: In [33]: np.mean(a, axis=0).dtype Out[33]: dtype('float32') In this respect np.var and np.std behave like np.mean, but np.sum always preserves the input dtype. (Which is curious because np.sum is just like np.mean in terms of potential loss of precision, right? The problem in np.mean is the accumulating error over many addition operations, not the divide-by-n at the end.) It is very disturbing that even after this discussion none of us here seem to actually have a precise understanding of how the numpy type selection system actually works :-(. We really need a formal description... -n
On 01/04/2013 02:46 PM, Nathaniel Smith wrote:
On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett
wrote: Hi,
Reading the discussion on the scalar casting rule change I realized I was hazy on the use-cases that led to the rule that scalars cast differently from arrays.
My impression was that the primary use-case was for lower-precision floats. That is, when you have a large float32 arr, you do not want to double your memory use with:
large_float32 + 1.0 # please no float64 here
Probably also:
large_int8 + 1 # please no int32 / int64 here.
That makes sense. On the other hand these are more ambiguous:
large_float32 + np.float64(1) # really - you don't want float64?
large_int8 + np.int32(1) # ditto
I wonder whether the main use-case was to deal with the automatic types of Python floats and scalars? That is, I wonder whether it would be worth considering (in the distant long term), doing fancy guess-what-you-mean stuff with Python scalars, on the basis that they are of unspecified dtype, and make 0 dimensional scalars follow the array casting rules. As in:
large_float32 + 1.0 # no upcast - we don't know what float type you meant for the scalar large_float32 + np.float64(1) # upcast - you clearly meant the scalar to be float64
Hmm, but consider this, which is exactly the operation in your example:
In [9]: a = np.arange(3, dtype=np.float32)
In [10]: a / np.mean(a) # normalize Out[10]: array([ 0., 1., 2.], dtype=float32)
In [11]: type(np.mean(a)) Out[11]: numpy.float64
Obviously the most common situation where it's useful to have the rule to ignore scalar width is for avoiding "width contamination" from Python float and int literals. But you can easily end up with numpy scalars from indexing, high-precision operations like np.mean, etc., where you don't "really mean" you want high-precision. And at least it's easy to understand the rule: same-kind scalars don't affect precision.
...Though arguably the bug here is that np.mean actually returns a value with higher precision. Interestingly, we seem to have some special cases so that if you want to normalize each row of a matrix, then again the dtype is preserved, but for a totally different reasons. In
a = np.arange(4, dtype=np.float32).reshape((2, 2)) a / np.mean(a, axis=0, keepdims=True)
the result has float32 type, even though this is an array/array operation, not an array/scalar operation. The reason is:
In [32]: np.mean(a).dtype Out[32]: dtype('float64')
But:
In [33]: np.mean(a, axis=0).dtype Out[33]: dtype('float32')
In this respect np.var and np.std behave like np.mean, but np.sum always preserves the input dtype. (Which is curious because np.sum is just like np.mean in terms of potential loss of precision, right? The problem in np.mean is the accumulating error over many addition operations, not the divide-by-n at the end.)
It is very disturbing that even after this discussion none of us here seem to actually have a precise understanding of how the numpy type selection system actually works :-(. We really need a formal description...
I think this is a usability wart -- if you don't understand, then newcomers certainly don't. Very naive question: If one is re-doing this anyway, how important are the primitive (non-record) NumPy scalars at all? How much would break if one simply always uses Python's int and double, declare that scalars never interacts with the dtype? a) any computation returning scalars can return float()/int() b) float() are silently truncated to float32 c) integral values that don't fit either wrap around/truncates/raises error d) the only things that determines dtype is the dtypes of arrays, never scalars Too naive? I guess the opposite idea is what Travis mentioned in his passing-the-torch post, about making scalars and 0-d-arrays the same. Dag Sverre
2013/1/4 Nathaniel Smith
On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett
wrote: Hi,
Reading the discussion on the scalar casting rule change I realized I was hazy on the use-cases that led to the rule that scalars cast differently from arrays.
My impression was that the primary use-case was for lower-precision floats. That is, when you have a large float32 arr, you do not want to double your memory use with:
large_float32 + 1.0 # please no float64 here
Probably also:
large_int8 + 1 # please no int32 / int64 here.
That makes sense. On the other hand these are more ambiguous:
large_float32 + np.float64(1) # really - you don't want float64?
large_int8 + np.int32(1) # ditto
I wonder whether the main use-case was to deal with the automatic types of Python floats and scalars? That is, I wonder whether it would be worth considering (in the distant long term), doing fancy guess-what-you-mean stuff with Python scalars, on the basis that they are of unspecified dtype, and make 0 dimensional scalars follow the array casting rules. As in:
large_float32 + 1.0 # no upcast - we don't know what float type you meant for the scalar large_float32 + np.float64(1) # upcast - you clearly meant the scalar to be float64
Hmm, but consider this, which is exactly the operation in your example:
In [9]: a = np.arange(3, dtype=np.float32)
In [10]: a / np.mean(a) # normalize Out[10]: array([ 0., 1., 2.], dtype=float32)
In [11]: type(np.mean(a)) Out[11]: numpy.float64
Obviously the most common situation where it's useful to have the rule to ignore scalar width is for avoiding "width contamination" from Python float and int literals. But you can easily end up with numpy scalars from indexing, high-precision operations like np.mean, etc., where you don't "really mean" you want high-precision. And at least it's easy to understand the rule: same-kind scalars don't affect precision.
...Though arguably the bug here is that np.mean actually returns a value with higher precision. Interestingly, we seem to have some special cases so that if you want to normalize each row of a matrix, then again the dtype is preserved, but for a totally different reasons. In
a = np.arange(4, dtype=np.float32).reshape((2, 2)) a / np.mean(a, axis=0, keepdims=True)
the result has float32 type, even though this is an array/array operation, not an array/scalar operation. The reason is:
In [32]: np.mean(a).dtype Out[32]: dtype('float64')
But:
In [33]: np.mean(a, axis=0).dtype Out[33]: dtype('float32')
In this respect np.var and np.std behave like np.mean, but np.sum always preserves the input dtype. (Which is curious because np.sum is just like np.mean in terms of potential loss of precision, right? The problem in np.mean is the accumulating error over many addition operations, not the divide-by-n at the end.)
IMO having a different dtype depending on whether or not you provide the "axis" argument to mean() should be considered as a bug. As to what the correct dtype should be... it's not such an easy question. Personally I would go with float64 by default to be consistent across all int / float dtypes. Then someone who wants to downcast it can use the "out" argument to mean(). To come back to Matthew's use-case question, I agree the most common use case is to prevent a float32 or small int array from being upcasted, and most of the time this would come from Python scalars. However I don't think it's a good idea to have a behavior that is different between Python and Numpy scalars, because it's a subtle difference that users could have trouble understanding & foreseeing. The expected behavior of numpy functions when providing them with non-numpy objects is they should behave the same as if we had called numpy.asarray() on these objects, and straying away from this behavior seems dangerous to me. As far as I'm concerned, in a world where numpy would be brand new with no existing codebase using it, I would probably prefer to use the same casting rules for array/array and array/scalar operations. It may cause some unwanted array upcasting, but it's a lot simpler to understand. However, given that there may be a lot of code relying on the current dtype-preserving behavior, doing it now doesn't sound like a good idea to me. -=- Olivier
On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett
In any case, can anyone remember the original use-cases well enough to record them for future decision making?
Heh. Everything new is old again. Here's a discussion from 2002 which quotes the rationale: http://mail.scipy.org/pipermail/numpy-discussion/2002-September/014002.html Note that in context: - numpy means the old Numeric library - AFAICT neither numeric nor numarray had special "scalar" types at this point, and they didn't have 0d arrays either, so in fact indexing an array would just return the closest python type (int or float). In fact this is a thread about the problems this causes. (So the question Dag raised downthread was prescient! Or, well, postscient, I guess.) So it looks like the main reason was actually that back then, you *couldn't* preserve non-native widths in operations involving scalars, because there was no such thing as a non-native width scalar. As soon as you called 'sum' or indexed an array, you reverted to native width. -n
participants (4)
-
Dag Sverre Seljebotn
-
Matthew Brett
-
Nathaniel Smith
-
Olivier Delalleau