Current state of the DType refactor

Hi all, with the latest changes, NumPy has a completely revised infrastructure for: * Creating a new array from Python objects * Organizing casting (to align almost fully with ufuncs) * Organize the inner-loops and dispatching of ufuncs * Create and "promote" DTypes Which means that the core functionality of the DType NEPs is implemented: * https://numpy.org/neps/nep-0041-improved-dtype-support.html * https://numpy.org/neps/nep-0042-new-dtypes.html * https://numpy.org/neps/nep-0043-extensible-ufuncs.html This is an important milestone and it allows user-defined DTypes (and ufuncs) with functionality beyond what was previously possible. E.g. see the examples in: https://github.com/seberg/experimental_user_dtypes Current user-implemented dtypes should already be able to achieve better integration by moving towards the new system (experimentally!). Many dtypes (categoricals, units, datetimes, etc.) that were previously not possible can now be implemented as well! Future Work ----------- The work is by no means complete, but the quality of the next steps should change: from a focus on large and careful refactors towards small changes that often unlock new features for user DTypes! It also means that "feature requests" coming from users testing out the new boundaries would be extremely helpful! The new API pushes boundaries and we have to map out what this means :). One larger remaining change is that we should slowly modernize NumPy's own ufuncs to utilize the new API within NumPy itself; though this is not necessarily urgent. Here is a list of things of various importance and urgency: * Moving "legacy" implementations of certain `PyArray_ArrayFuncs` (https://numpy.org/devdocs/reference/c-api/types-and-structures.html#c.PyArra...) slots to a new API on the DType. This can be done incrementally, and we can ask users to keep using the old API and transition to the new API when it is ready. Some examples are `nonzero`, sorting functions, or `copyswapn` which are still used within NumPy. (NumPy should add new API for them, and "fill" the slots with generic implementations where possible.) * Some user DTypes will want to use "references" (data being allocated for each element) which requires cleanup (e.g. `free`, `Py_DECREF`). There are two potential solutions: - Refactor the reference counting in NumPy (would be good in any case) - Add support for DTypes that use Python objects as their "storage" and reuse the current code. * Physical units will want to conveniently re-use existing NumPy ufunc loops (math functions). The general infrastructure supports this, but API needs to be added to permit it and make it easy. * There are many smaller things, for example: - Allow a user DType without a python scalar (similar to astropy's Quantity), for which `arr1d[0]` returns a 0-D array. - A few tweaks to the current API (floating point errors and views) * We could now "fix" `dtype="S"` to mean a string with undefined length and reject `np.dtype("S")` but allow `np.array([1, 2], dtype="S")`. * I would also like to improve alignment tracking and handling (which is interesting for the public and private UFunc and Casting API). Generally, the API needs to be finalized and exposed, since this is currently only experimental with certain changes expected: https://github.com/numpy/numpy/blob/main/numpy/core/include/numpy/experiment... Value-based casting/promotion ----------------------------- One major difficulty (and cause of inconsistencies), is the use of value-based casting: arr = np.array([1, 2, 3, 4], dtype=np.uint8) arr + 5 # result is `uint8` arr - 255 # result is `int16` We do not wish to support this for user DTypes, but we may want to support "weak promotion", where the user DType knows that the other operand was a Python integer, float, or complex and can "downpromote" it if desired. In the above example, the first would still work and the second should error or warn. To some degree, this is an extension of existing logic and it is mostly implemented. However, it is not yet integrated into the ufuncs for use with user DTypes. It would be helpful to change NumPy's behavior, but that would likely be a major version change.
participants (1)
-
Sebastian Berg