[Numpy-discussion] DType Roadmap/NEP Discussion

18 Sep 2019

      Hi all,

to try and make some progress towards a decision since the broad design
is pretty much settling from my side. I am thinking about making a
meeting, and suggest Monday at 11am Pacific Time (I am open to other
times though).

My hope is to get everyone interested on board, so that we can make an
informed decision about the general direction very soon. So just reach
out, or discuss on the mailing list as well.

The current draft for an NEP is here:
https://hackmd.io/kxuh15QGSjueEKft5SaMug?both

There are some design goals that I would like to clear up. I would
prefer to avoid deep discussions of some specific issues, since I think
the important decision right now is that my general start is in the
right direction.

It is not an easy topic, so my plan would be try and briefly summarize
that and then hopefully clarify any questions and then we can discuss
why alternatives are rejected. The most important thing is maybe
gathering concerns which need to be clarified before we can go towards
accepting the general design ideas.

The main point of the NEP draft is actually captured by the picture in
the linked document: DTypes are classes (such as Float64) and what is
attached to the array is an instance of that class "<float64" or
">float64". Additionally, we would have AbstractDType classes which
cannot be instantiated but define a type hierarchy.

To list the main points:

* DTypes are classes (corresponding to the current type number)

* `arr.dtype` is an instances of its class, allowing to store
  additional information such as a physical unit, the string length.

* Most things are defined in special dtype slots similar to Pythons
  type and number slots. They will be hidden and can be set through
  an init function similar to `PyType_FromSpec` [1].

* Promotion is defined primarily on the DType classes

* Casting from one DType to another DType is defined by a new
  CastingImpl object (should become a special ufunc)
    - e.g. for strings, the CastingImpl is in charge of finding the
      correct string length

* The AbstractDType hierarchy will be used to decide the signature when
  calling UFuncs.

The main iffier points I can think of are:

* NumPy currently uses value based promotion in some cases, which
  requires special AbstractDTypes to describe (and some legacy
  paths). (They are used use more like instances than typical classes)

* Casting between flexible dtypes (such as strings) is a multi-step
  process to figure out the actual output dtype.
    - An example is: `np.can_cast("float64", "S3")` first finding
      that `Float64->String` is possible in principle and then
      asking the CastingImpl to find that `float64->S3` is not.

* We have to break ABI compatibility in very minor, back-portable
  way. More smaller incompatibilities are likely [2].

* Since it is a major redesign, a lot of code has to be added/touched,
  although it is possible to channel much of it back into the old
  machinery.

* A largish amount of new API around new DType type objects and also
  DTypeMeta type objects, which users can (although usually do not have
  to) subclass.

However, most other designs will have similar issues. Basically, I
currently really think this is "right", even if some details may end up
a tricky.

Best,

Sebastian

PS: The one thing outside the more general list above that I may want
to discuss is how acceptable a global dict/mapping for dtype discovery
during `np.array` coercion is (mapping python type -> dtype)...

[1] https://docs.python.org/3/c-api/type.html#c.PyType_FromSpec
[2] One possible issue may be "S0" which is normally used to denote
what in the new API would be the `String` DType class.

[Numpy-discussion] DType Roadmap/NEP Discussion

Sebastian Berg