[Numpy-discussion] The status of DType Refactor

24 May 2021

      Hi all,

I thought I would give a brief update on where we are with new DTypes.
Partially for Matti who is braving the brunt of the review, but also
for anyone else interested.  Please don't hesitate to ask for
clarifications, any questions, or to schedule a meeting to discuss!

Recap
The past year, has seen most of the "big picture" changes merged into
NumPy, a good chunk already part of 1.20:
* dtype instances are not instances of np.dtype subclasses. I usually
write DType for those. But DTypeType is also a good name :).
* Array coercion using np.array(...) was completely rewritten, which
was necessary to allow new user DTypes.
* Introduced the ArrayMethod concept to unif casting and ufuncs as
much as possible (NEP 42/43):Casting was first fixed up to support
error returns."can-cast" logic was rewritten in terms of ArrayMethod
(i.e. casting safety checks are integrated into Arraymethod)Casting
largely reorganized around the ArrayMethod concept, including the
casting safety. (Also this)
* Promotion was implemented and later integrated everywhere, e.g. for
np.result_type(...).
* A larger refactor of UFuncs and a few smaller PRs set the stage for
the ufunc refactor (see currently in progress) 

With the exception of universal functions, the above list covers all
major areas of change in NumPy that are required to change. It also
implements many of the things that new user DTypes will need and
currently cannot do. Previously, these were either unavailable or
limited in various ways; especially when it comes to parametric DTypes
such as units or strings.

Currently in Progress
The current main reamining points are the universal functions. Since, a
majority of NumPy features are organized as universal functions, and
universal functions inheritently did not support parametric user
defined DTypes. These need a major change. This change is proposed in
NEP 43 (although that will need some smaller updates).

The work on implemeting it, is mostly settling in the following PR and
the following branch (I hope these will move in very soon):
* PR 18905: Implements new promotion, dispatching and use for most
ufuncs.
* My developement branch extends this to the reductions.
In parallel, the new DType API is only useful for users once it is
exposed, I have a branch here to experiment with that:
* The expermental DType API exposure branch.
* And a repository with (currently cython) examples using it. This
currently includes a very simplicitic Units DType and ufuncs for
strings (previous difficult or not really possible).
The exact way to write a new DType probably needs some alternative. But
note that this should largely be limited to the boilerplate code.

Future
The main step still remaining is figuring out how to exactly expose the
DType API best (ABI compatibility is the major concern) and finishing
the NEP 43 (or most of it) as closing up.

After that there are still some things that need to be done (although,
this is unlikely to be exhaustive):
* The way users should define new DTypes has to be decided (this seems
tricky, unfortunately).
* Some functionality is defined in the "old style" API that should be
removed/discouraged. This includes things like sorting functions.
(The old way could be allowed for a transition period.) To be
specific, these are the ((PyArray_Descr *)descr)->f->funcs.
* Some small parts of the new API are missing right now. E.g.
ensure_nbo() in current NumPy code, has to use the
ensure_canonical() as defined by NEP 42. Similarly, some parts will
need tweaking.
* Part of the API should be public, but it would also be nice to clean
them up before doing so; An example for this is the get_loop()
for/of ufuncs. For most use-cases, this is probably not too
important, but the API is a bit awkward currently. (It would be
possible to accept the awkward API and replace it in the future with
a new get_loop(), deprecating the old one slowly)
* There should be some new API for "reference counting" (more
generally, any item with memory management). Cleaning up the split
between the current transfer to NULL and PyArray_XDECREF. That is,
we should unify it as much as possible (probably by using the
transfer to NULL path). And then expose that also to custom DTypes.
* Some utility functionality is missing at this time. For example a
way for a Unit DType to fall back to the normal math implemented by
NumPy (after figuring out the unit part).
* A Python API is not on my explicit roadmap right now (although
probably not hard).

But most importantly, whatever comes up when potential users start
exploring the API, hopefully soon!

Otherwise, there are a couple of related improvements, that I think
would make sense. Such as considering storing the actual power-of-two
alignment in the array flags (they are getting a bit cramped if we
assume int can be 16 bits though). Also the discussion about removing
value based casting/promotion is one that would help with DTypes and
pushing it forward probably makes sense as soon as the items that are
"currently in progress" are largely settled and the next NumPy version
is released.

Cheers,

Sebastian

[Numpy-discussion] The status of DType Refactor

Sebastian Berg