[Numpy-discussion] Scalar casting rules use-case reprise

Fri Jan 4 09:34:29 EST 2013

2013/1/4 Nathaniel Smith <njs at pobox.com>:
> On Fri, Jan 4, 2013 at 11:09 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
>> Hi,
>>
>> Reading the discussion on the scalar casting rule change I realized I
>> was hazy on the use-cases that led to the rule that scalars cast
>> differently from arrays.
>>
>> My impression was that the primary use-case was for lower-precision
>> floats. That is, when you have a large float32 arr, you do not want to
>> double your memory use with:
>>
>>>>> large_float32 + 1.0 # please no float64 here
>>
>> Probably also:
>>
>>>>> large_int8 + 1 # please no int32 / int64 here.
>>
>> That makes sense.  On the other hand these are more ambiguous:
>>
>>>>> large_float32 + np.float64(1) # really - you don't want float64?
>>
>>>>> large_int8 + np.int32(1) # ditto
>>
>> I wonder whether the main use-case was to deal with the automatic
>> types of Python floats and scalars?  That is, I wonder whether it
>> would be worth considering (in the distant long term), doing fancy
>> guess-what-you-mean stuff with Python scalars, on the basis that they
>> are of unspecified dtype, and make 0 dimensional scalars follow the
>> array casting rules.  As in:
>>
>>>>> large_float32 + 1.0
>> # no upcast - we don't know what float type you meant for the scalar
>>>>> large_float32 + np.float64(1)
>> # upcast - you clearly meant the scalar to be float64
>
> Hmm, but consider this, which is exactly the operation in your example:
>
> In [9]: a = np.arange(3, dtype=np.float32)
>
> In [10]: a / np.mean(a) # normalize
> Out[10]: array([ 0.,  1.,  2.], dtype=float32)
>
> In [11]: type(np.mean(a))
> Out[11]: numpy.float64
>
> Obviously the most common situation where it's useful to have the rule
> to ignore scalar width is for avoiding "width contamination" from
> Python float and int literals. But you can easily end up with numpy
> scalars from indexing, high-precision operations like np.mean, etc.,
> where you don't "really mean" you want high-precision. And at least
> it's easy to understand the rule: same-kind scalars don't affect
> precision.
>
> ...Though arguably the bug here is that np.mean actually returns a
> value with higher precision. Interestingly, we seem to have some
> special cases so that if you want to normalize each row of a matrix,
> then again the dtype is preserved, but for a totally different
> reasons. In
>
> a = np.arange(4, dtype=np.float32).reshape((2, 2))
> a / np.mean(a, axis=0, keepdims=True)
>
> the result has float32 type, even though this is an array/array
> operation, not an array/scalar operation. The reason is:
>
> In [32]: np.mean(a).dtype
> Out[32]: dtype('float64')
>
> But:
>
> In [33]: np.mean(a, axis=0).dtype
> Out[33]: dtype('float32')
>
> In this respect np.var and np.std behave like np.mean, but np.sum
> always preserves the input dtype. (Which is curious because np.sum is
> just like np.mean in terms of potential loss of precision, right? The
> problem in np.mean is the accumulating error over many addition
> operations, not the divide-by-n at the end.)

IMO having a different dtype depending on whether or not you provide
the "axis" argument to mean() should be considered as a bug.
As to what the correct dtype should be... it's not such an easy
question. Personally I would go with float64 by default to be
consistent across all int / float dtypes. Then someone who wants to
downcast it can use the "out" argument to mean().

To come back to Matthew's use-case question, I agree the most common
use case is to prevent a float32 or small int array from being
upcasted, and most of the time this would come from Python scalars.
However I don't think it's a good idea to have a behavior that is
different between Python and Numpy scalars, because it's a subtle
difference that users could have trouble understanding & foreseeing.
The expected behavior of numpy functions when providing them with
non-numpy objects is they should behave the same as if we had called
numpy.asarray() on these objects, and straying away from this behavior
seems dangerous to me.

As far as I'm concerned, in a world where numpy would be brand new
with no existing codebase using it, I would probably prefer to use the
same casting rules for array/array and array/scalar operations. It may
cause some unwanted array upcasting, but it's a lot simpler to
understand. However, given that there may be a lot of code relying on
the current dtype-preserving behavior, doing it now doesn't sound like
a good idea to me.

-=- Olivier