[Numpy-discussion] Moving forward with value based casting

Fri Jun 7 14:50:37 EDT 2019

On Thu, 2019-06-06 at 19:34 -0400, Allan Haldane wrote:
> On 6/6/19 12:46 PM, Sebastian Berg wrote:
> > On Thu, 2019-06-06 at 11:57 -0400, Allan Haldane wrote:
> > > I think dtype-based casting makes a lot of sense, the problem is
> > > backward compatibility.
> > > 
> > > Numpy casting is weird in a number of ways: The array + array
> > > casting
> > > is
> > > unexpected to many users (eg, uint64 + int64 -> float64), and the
> > > casting of array + scalar is different from that, and value
> > > based.
> > > Personally I wouldn't want to try change it unless we make a
> > > backward-incompatible release (numpy 2.0), based on my experience
> > > trying
> > > to change much more minor things. We already put "casting" on the
> > > list
> > > of desired backward-incompatible changes on the list here:
> > > https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release
> > > 
> > > Relatedly, I've previously dreamed about a different "C-style"
> > > way
> > > casting might behave:
> > > https://gist.github.com/ahaldane/0f5ade49730e1a5d16ff6df4303f2e76
> > > 
> > > The proposal there is that array + array casting, array + scalar,
> > > and
> > > array + python casting would all work in the same dtype-based
> > > way,
> > > which
> > > mimics the familiar "C" casting rules.
> > 
> > If I read it right, you do propose that array + python would cast
> > in a
> > "minimal type" way for python.
> 
> I'm a little unclear what you mean by "minimal type" way. By "minimal
> type", I thought you and others are talking about the rule numpy
> currently uses that "the output dtype is the minimal dtype capable of
> representing the value of both input dtypes", right? But in that gist
> I
> am instead proposing that output-dtype is determined by C-like rules.
> 
> For array+py_scalar I was less certain what to do than for
> array+array
> and array+npy_scalar. But I proposed the three "ranks" of 1. bool, 2.
> int, and 3. float/complex. My rule for array+py_scalar is that if the
> python scalar's rank is less than the numpy operand dtype's rank, use
> the numpy dtype. If the python-scalar's rank is greater, use the
> "default" types of bool_, int64, float64 respectively. Eg:
> 
> np.bool_(1) + 1        -> int64   (default int wins)
> np.int8(1) + 1         -> int8    (numpy wins)
> np.uint8(1) + (-1)     -> uint8   (numpy wins)
> np.int64(1) + 1        -> int64   (numpy wins)
> np.int64(1) + 1.0      -> float64 (default float wins)
> np.float32(1.0) + 1.0  -> float32 (numpy wins)
> 
> Note it does not depend on the numerical value of the scalar, only
> its type.
> 
> > In your write up, you describe that if you mix array + scalar, the
> > scalar uses a minimal dtype compared to the array's dtype. 
> 
> Sorry if I'm nitpicking/misunderstanding, but in my rules
> np.uint64(1) +
> 1 -> uint64 but in numpy's "minimal dtype" rules it is  -> float64.
> So I
> don't think I am using the minimal rule.
> 
> > What we
> > instead have is that in principle you could have loops such as:
> > 
> > "ifi->f"
> > "idi->d"
> > 
> > and I think we should chose the first for a scalar, because it
> > "fits"
> > into f just fine. (if the input is) `ufunc(int_arr, 12., int_arr)`.
> 
> I feel I'm not understanding you, but the casting rules in my gist
> follow those two rules if i, f are the numpy types int32 and float32.
> 
> If instead you mean (np.int64, py_float, np.int64) my rules would
> cast
> to float64, since py_float has the highest rank and so is converted
> to
> the default numpy-type for that rank, float64.

Yes, you are right. I should look at them a bit more carefully in any
case. Actually, numpy would also choose the second one, because it
python float has the higher "category". The example should rather have
been:

int8, float32 -> float32
int64, float32 -> float64

With `python_int(12) + np.array([1., 2.], dtype=float64)`. Numpy would
currently choose the int8 loop here, because the scalar is of a lower
or equal "category" and thus it is OK to demote it even further.

This is fairly irrelevant for most users. But for ufunc dispatching, I
think it is where it gets ugly. In non-uniform ufunc dtype signatures,
and no, I doubt that this is very relevant in practice or that numpy is
even very consistent here.

I have a branch now which basically moves the "ResultType" logic before
choosing the loop (it thus is unable to capture some of the stranger,
probably non-existing corner cases).

On a different note: The ranking you are suggesting for python types
seems very much the same as what we have, with the exception that it
would not look at the value (I suppose what we would do instead is to
simply raise a casting error):

int8_arr + 87345  # ouput should always be int8, so crash on cast?

Which may be a viable approach. Although, signed/unsigned may be
tricky:

uint8_arr + py_int  # do we look at the py_int's sign?

- Sebastian

> 
> I would also add that unlike current numpy, my C-casting rules are
> associative (if all operands are numpy types, see note below), so it
> does not matter in which order you promote the types: (if)i  and
> i(fi)
> give the same result. In current numpy this is not always the case:
> 
>     p = np.promote_types
>     p(p('u2',   'i1'), 'f4')    # ->  f8
>     p(  'u2', p('i1',  'f4'))   # ->  f4
> 
> (However, my casting rules are not associative if you include python
> scalars.. eg  np.float32(1) + 1.0 + np.int64(1) . Maybe I should try
> to
> fix that...)
> 
> Best,
> Allan
> 
> > I do not mind keeping the "simple" two (or even more) operand "lets
> > assume we have uniform types" logic around. For those it is easy to
> > find a "minimum type" even before actual loop lookup.
> > For the above example it would work in any case well, but it would
> > get
> > complicating, if for example the last integer is an unsigned
> > integer,
> > that happens to be small enough to fit also into an integer.
> > 
> > That might give some wiggle room, possibly also to attach warnings
> > to
> > it, or at least make things easier. But I would also like to figure
> > out
> > as well if we shouldn't try to move in any case. Sure, attach a
> > major
> > version to it, but hopefully not a "big step type".
> > 
> > One thing that I had not thought about is, that if we create
> > FutureWarnings, we will need to provide a way to opt-in to the
> > new/old
> > behaviour.
> > The old behaviour can be achieved by just using the python types
> > (which
> > probably is what most code that wants this behaviour does already),
> > but
> > the behaviour is tricky. Users can pass `dtype` explicitly, but
> > that is
> > a huge kludge...
> > Will think about if there is a solution to that, because if there
> > is
> > not, you are right. It has to be a "big step" kind of release.
> > Although, even then it would be nice to have warnings that can be
> > enabled to ease the transition!
> > 
> > - Sebastian
> > 
> > 
> > > See also:
> > > https://github.com/numpy/numpy/issues/12525
> > > 
> > > Allan
> > > 
> > > 
> > > On 6/5/19 4:41 PM, Sebastian Berg wrote:
> > > > Hi all,
> > > > 
> > > > TL;DR:
> > > > 
> > > > Value based promotion seems complex both for users and ufunc-
> > > > dispatching/promotion logic. Is there any way we can move
> > > > forward
> > > > here,
> > > > and if we do, could we just risk some possible (maybe not-
> > > > existing)
> > > > corner cases to break early to get on the way?
> > > > 
> > > > -----------
> > > > 
> > > > Currently when you write code such as:
> > > > 
> > > > arr = np.array([1, 43, 23], dtype=np.uint16)
> > > > res = arr + 1
> > > > 
> > > > Numpy uses fairly sophisticated logic to decide that `1` can be
> > > > represented as a uint16, and thus for all unary functions (and
> > > > most
> > > > others as well), the output will have a `res.dtype` of uint16.
> > > > 
> > > > Similar logic also exists for floating point types, where a
> > > > lower
> > > > precision floating point can be used:
> > > > 
> > > > arr = np.array([1, 43, 23], dtype=np.float32)
> > > > (arr + np.float64(2.)).dtype  # will be float32
> > > > 
> > > > Currently, this value based logic is enforced by checking
> > > > whether
> > > > the
> > > > cast is possible: "4" can be cast to int8, uint8. So the first
> > > > call
> > > > above will at some point check if "uint16 + uint16 -> uint16"
> > > > is a
> > > > valid operation, find that it is, and thus stop searching.
> > > > (There
> > > > is
> > > > the additional logic, that when both/all operands are scalars,
> > > > it
> > > > is
> > > > not applied).
> > > > 
> > > > Note that while it is defined in terms of casting "1" to uint8
> > > > safely
> > > > being possible even though 1 may be typed as int64. This logic
> > > > thus
> > > > affects all promotion rules as well (i.e. what should the
> > > > output
> > > > dtype
> > > > be).
> > > > 
> > > > 
> > > > There 2 main discussion points/issues about it:
> > > > 
> > > > 1. Should value based casting/promotion logic exist at all?
> > > > 
> > > > Arguably an `np.int32(3)` has type information attached to it,
> > > > so
> > > > why
> > > > should we ignore it. It can also be tricky for users, because a
> > > > small
> > > > change in values can change the result data type.
> > > > Because 0-D arrays and scalars are too close inside numpy (you
> > > > will
> > > > often not know which one you get). There is not much option but
> > > > to
> > > > handle them identically. However, it seems pretty odd that:
> > > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> > > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > > > 
> > > > give a different result.
> > > > 
> > > > This is a bit different for python scalars, which do not have a
> > > > type
> > > > attached already.
> > > > 
> > > > 
> > > > 2. Promotion and type resolution in Ufuncs:
> > > > 
> > > > What is currently bothering me is that the decision what the
> > > > output
> > > > dtypes should be currently depends on the values in complicated
> > > > ways.
> > > > It would be nice if we can decide which type signature to use
> > > > without
> > > > actually looking at values (or at least only very early on).
> > > > 
> > > > One reason here is caching and simplicity. I would like to be
> > > > able
> > > > to
> > > > cache which loop should be used for what input. Having value
> > > > based
> > > > casting in there bloats up the problem.
> > > > Of course it currently works OK, but especially when user
> > > > dtypes
> > > > come
> > > > into play, caching would seem like a nice optimization option.
> > > > 
> > > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is
> > > > not
> > > > as
> > > > simple as finding the "minimal" dtype once and working with
> > > > that." 
> > > > Of course Eric and I discussed this a bit before, and you could
> > > > create
> > > > an internal "uint7" dtype which has the only purpose of
> > > > flagging
> > > > that a
> > > > cast to int8 is safe.
> > > > 
> > > > I suppose it is possible I am barking up the wrong tree here,
> > > > and
> > > > this
> > > > caching/predictability is not vital (or can be solved with such
> > > > an
> > > > internal dtype easily, although I am not sure it seems
> > > > elegant).
> > > > 
> > > > 
> > > > Possible options to move forward
> > > > --------------------------------
> > > > 
> > > > I have to still see a bit how trick things are. But there are a
> > > > few
> > > > possible options. I would like to move the scalar logic to the
> > > > beginning of ufunc calls:
> > > >   * The uint7 idea would be one solution
> > > >   * Simply implement something that works for numpy and all
> > > > except
> > > >     strange external ufuncs (I can only think of numba as a
> > > > plausible
> > > >     candidate for creating such).
> > > > 
> > > > My current plan is to see where the second thing leaves me.
> > > > 
> > > > We also should see if we cannot move the whole thing forward,
> > > > in
> > > > which
> > > > case the main decision would have to be forward to where. My
> > > > opinion is
> > > > currently that when a type has a dtype associated with it
> > > > clearly,
> > > > we
> > > > should always use that dtype in the future. This mostly means
> > > > that
> > > > numpy dtypes such as `np.int64` will always be treated like an
> > > > int64,
> > > > and never like a `uint8` because they happen to be castable to
> > > > that.
> > > > 
> > > > For values without a dtype attached (read python integers,
> > > > floats),
> > > > I
> > > > see three options, from more complex to simpler:
> > > > 
> > > > 1. Keep the current logic in place as much as possible
> > > > 2. Only support value based promotion for operators, e.g.:
> > > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will
> > > > not.
> > > >    The upside is that it limits the complexity to a much
> > > > simpler
> > > >    problem, the downside is that the ufunc call and operator
> > > > match
> > > >    less clearly.
> > > > 3. Just associate python float with float64 and python integers
> > > > with
> > > >    long/int64 and force users to always type them explicitly if
> > > > they
> > > >    need to.
> > > > 
> > > > The downside of 1. is that it doesn't help with simplifying the
> > > > current
> > > > situation all that much, because we still have the special
> > > > casting
> > > > around...
> > > > 
> > > > 
> > > > I have realized that this got much too long, so I hope it makes
> > > > sense.
> > > > I will continue to dabble along on these things a bit, so if
> > > > nothing
> > > > else maybe writing it helps me to get a bit clearer on
> > > > things...
> > > > 
> > > > Best,
> > > > 
> > > > Sebastian
> > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > NumPy-Discussion at python.org
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > 
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190607/0893c09a/attachment-0001.sig>