Hi all,
Maybe to clarify this at least a little, here are some examples for
what currently happen and what I could imagine we can go to (all in
terms of output dtype).
float32_arr = np.ones(10, dtype=np.float32)
int8_arr = np.ones(10, dtype=np.int8)
uint8_arr = np.ones(10, dtype=np.uint8)
Current behaviour:
------------------
float32_arr + 12. # float32
float32_arr + 2**200 # float64 (because np.float32(2**200) == np.inf)
int8_arr + 127 # int8
int8_arr + 128 # int16
int8_arr + 2**20 # int32
uint8_arr + -1 # uint16
# But only for arrays that are not 0d:
int8_arr + np.array(1, dtype=np.int32) # int8
int8_arr + np.array([1], dtype=np.int32) # int32
# When the actual typing is given, this does not change:
float32_arr + np.float64(12.) # float32
float32_arr + np.array(12., dtype=np.float64) # float32
# Except for inexact types, or complex:
int8_arr + np.float16(3) # float16 (same as array behaviour)
# The exact same happens with all ufuncs:
np.add(float32_arr, 1) # float32
np.add(float32_arr, np.array(12., dtype=np.float64) # float32
Keeping Value based casting only for python types
-------------------------------------------------
In this case, most examples above stay unchanged, because they use
plain python integers or floats, such as 2, 127, 12., 3, ... without
any type information attached, such as `np.float64(12.)`.
These change for example:
float32_arr + np.float64(12.) # float64
float32_arr + np.array(12., dtype=np.float64) # float64
np.add(float32_arr, np.array(12., dtype=np.float64) # float64
# so if you use `np.int32` it will be the same as np.uint64(10000)
int8_arr + np.int32(1) # int32
int8_arr + np.int32(2**20) # int32
Remove Value based casting completely
-------------------------------------
We could simply abolish it completely, a python `1` would always behave
the same as `np.int_(1)`. The downside of this is that:
int8_arr + 1 # int64 (or int32)
uses much more memory suddenly. Or, we remove it from ufuncs, but not
from operators:
int8_arr + 1 # int8 dtype
but:
np.add(int8_arr, 1) # int64
# same as:
np.add(int8_arr, np.array(1)) # int16
The main reason why I was wondering about that is that for operators
the logic seems fairly simple, but for general ufuncs it seems more
complex.
Best,
Sebastian
On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward
> here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output
> dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
> * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could
> create
> an internal "uint7" dtype which has the only purpose of flagging that
> a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and
> this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
> * The uint7 idea would be one solution
> * Simply implement something that works for numpy and all except
> strange external ufuncs (I can only think of numba as a plausible
> candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in
> which
> case the main decision would have to be forward to where. My opinion
> is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
> `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> The upside is that it limits the complexity to a much simpler
> problem, the downside is that the ufunc call and operator match
> less clearly.
> 3. Just associate python float with float64 and python integers with
> long/int64 and force users to always type them explicitly if they
> need to.
>
> The downside of 1. is that it doesn't help with simplifying the
> current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes
> sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion