[Numpy-discussion] Moving forward with value based casting

Mon Jun 17 22:32:59 EDT 2019

On Tue, 2019-06-18 at 04:28 +0200, Hameer Abbasi wrote:
> On Wed, 2019-06-12 at 12:55 -0500, Sebastian Berg wrote:
> > On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> > > Hi all,
> > > 
> > > TL;DR:
> > > 
> > > Value based promotion seems complex both for users and ufunc-
> > > dispatching/promotion logic. Is there any way we can move forward
> > > here,
> > > and if we do, could we just risk some possible (maybe not-
> > > existing)
> > > corner cases to break early to get on the way?
> > > 
> > 
> > Hi all,
> > 
> > just to note. I think I will go forward trying to fill the hole in
> > the
> > hierarchy with a non-existing uint7 dtype. That seemed like it may
> > be
> > ugly, but if it does not escalate too much, it is probably fairly
> > straight forward. And it would allow to simplify dispatching
> > without
> > any logic change at all. After that we could still decide to change
> > the
> > logic.
> 
> Hi Sebastian!
> 
> This seems like the right approach to me as well, I would just add
> one
> additional comment. Earlier on, you mentioned that a lot of "strange"
> dtypes will pop up when dealing with floats/ints. E.g. int15, int31,
> int63, int52 (for checking double-compat), int23 (single compat),
> int10
> (half compat) and so on and so forth. The lookup table would get
> tricky
> to populate by hand --- It might be worth it to use the logic I
> suggested to autogenerate it in some way, or to "determine" the
> temporary underspecified type, as Nathaniel proposed in his email to
> the list. That is, we store the number of:
> 
> * flag (0 for numeric, 1 for non-numeric)
> * sign bits (0 for unsigned ints, 1 else)
> * integer/fraction bits (self-explanatory)
> * exponent bits (self-explanatory)
> * Log-Number of items (0 for real, 1 for complex, 2 for quarternion,
> etc.) (I propose log because the Cayley-Dickson algebras [1] require
> a
> power of two)
> 
> A type is safely castable to another if all of these numbers are
> exceeded or met.
> 
> This would give us a clean way for registering new numeric types,
> while
> also cleanly hooking into the type system, and solving the casting
> scenario. Of course, I'm not proposing we generate the loops for or
> provide all these types ourselves, but simply that we allow people to
> define dtypes using such a schema. I do worry that we're special-
> casing 
> numbers here, but it is "Num"Py, so I'm also not too worried.
> 
> This flexibility would, for example, allow us to easily define a
> bfloat16/bcomplex32 type with all the "can_cast" logic in place, even
> if people have to register their own casts or loops (and just to be
> clear, we error if they are not). It also makes it easy to define
> loops
> for int128 and so on if they come along.
> 
> The only open question left here is: What to do with a case like
> int64
> + uint64. And what I propose is we abandon purity for pragmatism here
> and tell ourselves that losing one sign bit is tolerable 90% of the
> time, and going to floating-point is probably worse. It's more of a
> range-versus-accuracy question, and I would argue that people using
> integers expect exactness. Of course, I doubt anyone is actually
> relying on the fact that adding two integers produces floating-point
> results, and it has been the cause of at least one bug, which
> highlights that integers can be used in places where floats cannot.
> [0]

P.S. Someone collected a list of issues where the automatic float-
conversion breaks things, it's old but it does highlight the importance
of the issue: [0]

https://github.com/numpy/numpy/issues/12525#issuecomment-457727726

Hameer Abbasi

> 
> Hameer Abbasi
> 
> [0] https://github.com/numpy/numpy/issues/9982
> [1] https://en.wikipedia.org/wiki/Cayley%E2%80%93Dickson_construction
> 
> > Best,
> > 
> > Sebastian
> > 
> > 
> > > -----------
> > > 
> > > Currently when you write code such as:
> > > 
> > > arr = np.array([1, 43, 23], dtype=np.uint16)
> > > res = arr + 1
> > > 
> > > Numpy uses fairly sophisticated logic to decide that `1` can be
> > > represented as a uint16, and thus for all unary functions (and
> > > most
> > > others as well), the output will have a `res.dtype` of uint16.
> > > 
> > > Similar logic also exists for floating point types, where a lower
> > > precision floating point can be used:
> > > 
> > > arr = np.array([1, 43, 23], dtype=np.float32)
> > > (arr + np.float64(2.)).dtype  # will be float32
> > > 
> > > Currently, this value based logic is enforced by checking whether
> > > the
> > > cast is possible: "4" can be cast to int8, uint8. So the first
> > > call
> > > above will at some point check if "uint16 + uint16 -> uint16" is
> > > a
> > > valid operation, find that it is, and thus stop searching. (There
> > > is
> > > the additional logic, that when both/all operands are scalars, it
> > > is
> > > not applied).
> > > 
> > > Note that while it is defined in terms of casting "1" to uint8
> > > safely
> > > being possible even though 1 may be typed as int64. This logic
> > > thus
> > > affects all promotion rules as well (i.e. what should the output
> > > dtype
> > > be).
> > > 
> > > 
> > > There 2 main discussion points/issues about it:
> > > 
> > > 1. Should value based casting/promotion logic exist at all?
> > > 
> > > Arguably an `np.int32(3)` has type information attached to it, so
> > > why
> > > should we ignore it. It can also be tricky for users, because a
> > > small
> > > change in values can change the result data type.
> > > Because 0-D arrays and scalars are too close inside numpy (you
> > > will
> > > often not know which one you get). There is not much option but
> > > to
> > > handle them identically. However, it seems pretty odd that:
> > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > > 
> > > give a different result.
> > > 
> > > This is a bit different for python scalars, which do not have a
> > > type
> > > attached already.
> > > 
> > > 
> > > 2. Promotion and type resolution in Ufuncs:
> > > 
> > > What is currently bothering me is that the decision what the
> > > output
> > > dtypes should be currently depends on the values in complicated
> > > ways.
> > > It would be nice if we can decide which type signature to use
> > > without
> > > actually looking at values (or at least only very early on).
> > > 
> > > One reason here is caching and simplicity. I would like to be
> > > able
> > > to
> > > cache which loop should be used for what input. Having value
> > > based
> > > casting in there bloats up the problem.
> > > Of course it currently works OK, but especially when user dtypes
> > > come
> > > into play, caching would seem like a nice optimization option.
> > > 
> > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is
> > > not
> > > as
> > > simple as finding the "minimal" dtype once and working with
> > > that." 
> > > Of course Eric and I discussed this a bit before, and you could
> > > create
> > > an internal "uint7" dtype which has the only purpose of flagging
> > > that
> > > a
> > > cast to int8 is safe.
> > > 
> > > I suppose it is possible I am barking up the wrong tree here, and
> > > this
> > > caching/predictability is not vital (or can be solved with such
> > > an
> > > internal dtype easily, although I am not sure it seems elegant).
> > > 
> > > 
> > > Possible options to move forward
> > > --------------------------------
> > > 
> > > I have to still see a bit how trick things are. But there are a
> > > few
> > > possible options. I would like to move the scalar logic to the
> > > beginning of ufunc calls:
> > >   * The uint7 idea would be one solution
> > >   * Simply implement something that works for numpy and all
> > > except
> > >     strange external ufuncs (I can only think of numba as a
> > > plausible
> > >     candidate for creating such).
> > > 
> > > My current plan is to see where the second thing leaves me.
> > > 
> > > We also should see if we cannot move the whole thing forward, in
> > > which
> > > case the main decision would have to be forward to where. My
> > > opinion
> > > is
> > > currently that when a type has a dtype associated with it
> > > clearly,
> > > we
> > > should always use that dtype in the future. This mostly means
> > > that
> > > numpy dtypes such as `np.int64` will always be treated like an
> > > int64,
> > > and never like a `uint8` because they happen to be castable to
> > > that.
> > > 
> > > For values without a dtype attached (read python integers,
> > > floats),
> > > I
> > > see three options, from more complex to simpler:
> > > 
> > > 1. Keep the current logic in place as much as possible
> > > 2. Only support value based promotion for operators, e.g.:
> > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> > >    The upside is that it limits the complexity to a much simpler
> > >    problem, the downside is that the ufunc call and operator
> > > match
> > >    less clearly.
> > > 3. Just associate python float with float64 and python integers
> > > with
> > >    long/int64 and force users to always type them explicitly if
> > > they
> > >    need to.
> > > 
> > > The downside of 1. is that it doesn't help with simplifying the
> > > current
> > > situation all that much, because we still have the special
> > > casting
> > > around...
> > > 
> > > 
> > > I have realized that this got much too long, so I hope it makes
> > > sense.
> > > I will continue to dabble along on these things a bit, so if
> > > nothing
> > > else maybe writing it helps me to get a bit clearer on things...
> > > 
> > > Best,
> > > 
> > > Sebastian
> > > 
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion