[Numpy-discussion] Do we want scalar casting to behave as it does at the moment?

Wed Jan 9 10:09:21 EST 2013

On Tue, Jan 8, 2013 at 9:14 PM, Andrew Collette
<andrew.collette at gmail.com> wrote:
> Hi Nathaniel,
>
> (Responding to both your emails)
>
>> The problem is that rule for arrays - and for every other party of
>> numpy in general - are that we *don't* pick types based on values.
>> Numpy always uses input types to determine output types, not input
>> values.
>
> Yes, of course... array operations are governed exclusively by their
> dtypes.  It seems to me that, using the language of the bug report
> (2878), if we have this:
>
> result = arr + scalar
>
> I would argue that our job is, rather than to pick result.dtype, to
> pick scalar.dtype, and apply the normal rules for array operations.

Okay, but we already have unambiguous rules for picking scalar.dtype:
you use whatever width the underlying type has, so it'd always be
np.int_ or np.float64. Those are the normal rules for picking dtypes.

I'm just trying to make clear that what you're arguing for is also a
very special case, which also violates the rules numpy uses everywhere
else. That doesn't mean we should rule it out ("Special cases aren't
special enough to break the rules. / Although practicality beats
purity."), but claiming that it is just "the normal rules" while
everything else is a "special case" is rhetorically unhelpful.

>> So it's pretty unambiguous that
>> "using the same rules for arrays and scalars" would mean, ignore the
>> value of the scalar, and in expressions like
>>   np.array([1], dtype=np.int8) + 1
>> we should always upcast to int32/int64.
>
> Ah, but that's my point: we already, in 1.6, ignore the intrinsic
> width of the scalar and effectively substitute one based on it's
> value:
>
>>>> a = np.array([1], dtype=int8)
>>>> (a + 1).dtype
> dtype('int8')
>>>> (a + 1000).dtype
> dtype('int16')
>>>> (a + 90000).dtype
> dtype('int32')
>>>> (a + 2**40).dtype
> dtype('int64')

Sure. But the only reason this is in 1.6 is that the person who made
the change never mentioned it to anyone else, so it wasn't noticed
until after 1.6 came out. If it had gone through proper review/mailing
list discussion (like we're doing now) then it's very unlikely it
would have gone in in its present form.

>> 1.6, your proposal: in a binary operation, if one operand has ndim==0
>> and the other has ndim>0, downcast the ndim==0 item to the smallest
>> width that is consistent with its value and the other operand's type.
>
> Yes, exactly.  I'm not trying to propose a completely new behavior: as
> I mentioned (although very far upthread), this is the mental model I
> had of how things worked in 1.6 already.
>
>> New users don't use narrow-width dtypes... it's important to remember
>> in this discussion that in numpy, non-standard dtypes only arise when
>> users explicitly request them, so there's some expressed intention
>> there that we want to try and respect.
>
> I would respectfully disagree.  One example I cited was that when
> dealing with HDF5, it's very common to get int16's (and even int8's)
> when reading from a file because they are used to save disk space.
> All a new user has to do to get int8's from a file they got from
> someone else is:
>
>>>> data = some_hdf5_file['MyDataset'][...]
>
> This is a general issue applying to data which is read from real-world
> external sources.  For example, digitizers routinely represent their
> samples as int8's or int16's, and you apply a scale and offset to get
> a reading in volts.

This particular case is actually handled fine by 1.5, because int
array + float scalar *does* upcast to float. It's width that's ignored
(int8 versus int32), not the basic "kind" of data (int versus float).

But overall this does sound like a problem -- but it's not a problem
with the scalar/array rules, it's a problem with working with narrow
width data in general. There's a good argument to be made that data
files should be stored in compressed form, but read in in full-width
form, exactly to avoid the problems that arise when trying to
manipulate narrow-width representations.

Suppose your scale and offset *were* integers, so that the "kind"
casting rules didn't get invoked. Even if this were the case, then the
rules you're arguing for would not actually solve your problem at all.
It'd be very easy to have, say, scale=100, offset=100, both of which
fit fine in an int8... but actually performing the scaling/offseting
in an int8 would still be a terrible idea! The problem you're talking
about is picking the correct width for an *operation*, and futzing
about with picking the dtypes of *one input* to that operation is not
going to help; it's like trying to ensure your house won't fall down
by making sure the doors are really sturdy.

-n