[Numpy-discussion] Do we want scalar casting to behave as it does at the moment?

Matthew Brett matthew.brett at gmail.com
Thu Jan 17 09:26:16 EST 2013


Hi,

On Wed, Jan 9, 2013 at 6:07 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 01/09/2013 06:22 PM, Chris Barker - NOAA Federal wrote:
>> On Wed, Jan 9, 2013 at 7:09 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> This is a general issue applying to data which is read from real-world
>>>> external sources.  For example, digitizers routinely represent their
>>>> samples as int8's or int16's, and you apply a scale and offset to get
>>>> a reading in volts.
>>>
>>> This particular case is actually handled fine by 1.5, because int
>>> array + float scalar *does* upcast to float. It's width that's ignored
>>> (int8 versus int32), not the basic "kind" of data (int versus float).
>>>
>>> But overall this does sound like a problem -- but it's not a problem
>>> with the scalar/array rules, it's a problem with working with narrow
>>> width data in general.
>>
>> Exactly -- this is key. details asside, we essentially have a choice
>> between an approach that makes it easy to preserver your values --
>> upcasting liberally, or making it easy to preserve your dtype --
>> requiring users to specifically upcast where needed.
>>
>> IIRC, our experience with earlier versions of numpy (and Numeric
>> before that) is that all too often folks would choose a small dtype
>> quite deliberately, then have it accidentally upcast for them -- this
>> was determined to be not-so-good behavior.
>>
>> I think the HDF (and also netcdf...) case is a special case -- the
>> small dtype+scaling has been chosen deliberately by whoever created
>> the data file (to save space), but we would want it generally opaque
>> to the consumer of the file -- to me, that means the issue should be
>> adressed by the file reading tools, not numpy. If your HDF5 reader
>> chooses the the resulting dtype explicitly, it doesn't matter what
>> numpy's defaults are. If the user wants to work with the raw, unscaled
>> arrays, then they should know what they are doing.
>
> +1. I think h5py should consider:
>
> File("my.h5")['int8_dset'].dtype == int64
> File("my.h5", preserve_dtype=True)['int8_dset'].dtype == int8

Returning to this thread - did we have a decision?

With further reflection, it seems to me we will have a tough time
going back to the 1.5 behavior now - we might be shutting the stable
door after the cat is out of the bag, if you see what I mean.

Maybe we should change the question to the desirable behavior in the
long term.

I am starting to wonder if we should aim for making

* scalar and array casting rules the same;
* Python int / float scalars become int32 / 64 or float64;

This has the benefit of being very easy to understand and explain.  It
makes dtypes predictable in the sense they don't depend on value.

Those wanting to maintain - say - float32 will need to cast scalars to float32.

Maybe the use-cases motivating the scalar casting rules - maintaining
float32 precision in particular - can be dealt with by careful casting
of scalars, throwing the burden onto the memory-conscious to maintain
their dtypes.

Or is there a way of using flags to ufuncs to emulate the 1.5 casting rules?

Do y'all agree this is desirable in the long term?

If so, how should we get there?  It seems to me we're about 25 percent
of the way there with the current scalar casting rule.

Cheers,

Matthew



More information about the NumPy-Discussion mailing list