Re: [Numpy-discussion] Moving forward with value based casting

June 5, 2019

      Hi all,

Maybe to clarify this at least a little, here are some examples for
what currently happen and what I could imagine we can go to (all in
terms of output dtype).

float32_arr = np.ones(10, dtype=np.float32)
int8_arr = np.ones(10, dtype=np.int8)
uint8_arr = np.ones(10, dtype=np.uint8)

Current behaviour:
------------------

float32_arr + 12.  # float32
float32_arr + 2**200  # float64 (because np.float32(2**200) == np.inf)

int8_arr + 127     # int8
int8_arr + 128     # int16
int8_arr + 2**20   # int32
uint8_arr + -1     # uint16

# But only for arrays that are not 0d:
int8_arr + np.array(1, dtype=np.int32)  # int8
int8_arr + np.array([1], dtype=np.int32)  # int32

# When the actual typing is given, this does not change:

float32_arr + np.float64(12.)                  # float32
float32_arr + np.array(12., dtype=np.float64)  # float32

# Except for inexact types, or complex:
int8_arr + np.float16(3)  # float16  (same as array behaviour)

# The exact same happens with all ufuncs:
np.add(float32_arr, 1)                               # float32
np.add(float32_arr, np.array(12., dtype=np.float64)  # float32

Keeping Value based casting only for python types
-------------------------------------------------

In this case, most examples above stay unchanged, because they use
plain python integers or floats, such as 2, 127, 12., 3, ... without
any type information attached, such as `np.float64(12.)`.

These change for example:

float32_arr + np.float64(12.)                        # float64
float32_arr + np.array(12., dtype=np.float64)        # float64
np.add(float32_arr, np.array(12., dtype=np.float64)  # float64

# so if you use `np.int32` it will be the same as np.uint64(10000)

int8_arr + np.int32(1)      # int32
int8_arr + np.int32(2**20)  # int32

Remove Value based casting completely
-------------------------------------

We could simply abolish it completely, a python `1` would always behave
the same as `np.int_(1)`. The downside of this is that:

int8_arr + 1  # int64 (or int32)

uses much more memory suddenly. Or, we remove it from ufuncs, but not
from operators:

int8_arr + 1  # int8 dtype

but:

np.add(int8_arr, 1)  # int64
# same as:
np.add(int8_arr, np.array(1))  # int16

The main reason why I was wondering about that is that for operators
the logic seems fairly simple, but for general ufuncs it seems more
complex.

Best,

Sebastian

On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
...
Hi all,
TL;DR:
Value based promotion seems complex both for users and ufunc-
dispatching/promotion logic. Is there any way we can move forward
here,
and if we do, could we just risk some possible (maybe not-existing)
corner cases to break early to get on the way?
-----------
Currently when you write code such as:
arr = np.array([1, 43, 23], dtype=np.uint16)
res = arr + 1
Numpy uses fairly sophisticated logic to decide that `1` can be
represented as a uint16, and thus for all unary functions (and most
others as well), the output will have a `res.dtype` of uint16.
Similar logic also exists for floating point types, where a lower
precision floating point can be used:
arr = np.array([1, 43, 23], dtype=np.float32)
(arr + np.float64(2.)).dtype  # will be float32
Currently, this value based logic is enforced by checking whether the
cast is possible: "4" can be cast to int8, uint8. So the first call
above will at some point check if "uint16 + uint16 -> uint16" is a
valid operation, find that it is, and thus stop searching. (There is
the additional logic, that when both/all operands are scalars, it is
not applied).
Note that while it is defined in terms of casting "1" to uint8 safely
being possible even though 1 may be typed as int64. This logic thus
affects all promotion rules as well (i.e. what should the output
dtype
be).
There 2 main discussion points/issues about it:
1. Should value based casting/promotion logic exist at all?
Arguably an `np.int32(3)` has type information attached to it, so why
should we ignore it. It can also be tricky for users, because a small
change in values can change the result data type.
Because 0-D arrays and scalars are too close inside numpy (you will
often not know which one you get). There is not much option but to
handle them identically. However, it seems pretty odd that:
 * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
 * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
give a different result.
This is a bit different for python scalars, which do not have a type
attached already.
2. Promotion and type resolution in Ufuncs:
What is currently bothering me is that the decision what the output
dtypes should be currently depends on the values in complicated ways.
It would be nice if we can decide which type signature to use without
actually looking at values (or at least only very early on).
One reason here is caching and simplicity. I would like to be able to
cache which loop should be used for what input. Having value based
casting in there bloats up the problem.
Of course it currently works OK, but especially when user dtypes come
into play, caching would seem like a nice optimization option.
Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
as
simple as finding the "minimal" dtype once and working with that." 
Of course Eric and I discussed this a bit before, and you could
create
an internal "uint7" dtype which has the only purpose of flagging that
a
cast to int8 is safe.
I suppose it is possible I am barking up the wrong tree here, and
this
caching/predictability is not vital (or can be solved with such an
internal dtype easily, although I am not sure it seems elegant).
Possible options to move forward
--------------------------------
I have to still see a bit how trick things are. But there are a few
possible options. I would like to move the scalar logic to the
beginning of ufunc calls:
  * The uint7 idea would be one solution
  * Simply implement something that works for numpy and all except
    strange external ufuncs (I can only think of numba as a plausible
    candidate for creating such).
My current plan is to see where the second thing leaves me.
We also should see if we cannot move the whole thing forward, in
which
case the main decision would have to be forward to where. My opinion
is
currently that when a type has a dtype associated with it clearly, we
should always use that dtype in the future. This mostly means that
numpy dtypes such as `np.int64` will always be treated like an int64,
and never like a `uint8` because they happen to be castable to that.
For values without a dtype attached (read python integers, floats), I
see three options, from more complex to simpler:
1. Keep the current logic in place as much as possible
2. Only support value based promotion for operators, e.g.:
   `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
   The upside is that it limits the complexity to a much simpler
   problem, the downside is that the ufunc call and operator match
   less clearly.
3. Just associate python float with float64 and python integers with
   long/int64 and force users to always type them explicitly if they
   need to.
The downside of 1. is that it doesn't help with simplifying the
current
situation all that much, because we still have the special casting
around...
I have realized that this got much too long, so I hope it makes
sense.
I will continue to dabble along on these things a bit, so if nothing
else maybe writing it helps me to get a bit clearer on things...
Best,
Sebastian
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion