Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it: https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac... and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3... (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
----------------------------------------------------------------------
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract --------
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope --------------------
.. seealso::
The user impact section includes examples of what kind of new datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
* Creating casting rules for parametric user-defined dtypes is either impossible or so complex that it has never been attempted. * Type promotion, e.g. the operation deciding that adding float and integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes. * Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself. * In the current design datatypes cannot have methods that do not generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
* Phase I: Restructure and extend the datatype infrastructure (This NEP 41) * Phase II: Incrementally define or rework API (Detailed largely in NEPs 42/43) * Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
1. Each datatype will be an instance of a subclass of ``np.dtype``, with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
2. The current NumPy scalars will *not* change, they will not be instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
3. All new C-API functions provided to the user will hide implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
4. To enable all of the desired functionality for new user-defined datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact -----------
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
* bfloat16, used in deep learning * categorical types * physical units (such as meters) * datatypes for tracing/automatic differentiation * high, fixed precision math * specialized integer types such as int2, int24 * new, better datetime representations * extending e.g. integer dtypes to have a sentinel NA value * geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` https://en.wikipedia.org/wiki/Bfloat16_floating-point_format` are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3], dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user-defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"], dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype != np.result_type(meters, seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation --------------
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
* Phase I: Restructure and extend the datatype infrastructure (This NEP)
* Organize Datatypes like normal Python classes [`PR 15508`]_
* Phase II: Incrementally define or rework API
* Create a new and easily extensible API for defining new datatypes and related functionality. (NEP 42)
* Incrementally define all necessary functionality through the new API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from Python scalars (i.e. ``np.array(...)``). * …
* Restructure how universal functions work (NEP 43), in order to:
* make it possible to allow a `~numpy.ufunc` such as ``np.add`` to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for user-defined datatypes.
* enable reuse of existing code. Units should be able to use the normal math loops and add additional logic to determine output type.
* Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities:
* Cleanup of legacy behaviour where it is considered buggy or undesirable. * Provide a path to define new datatypes from Python. * Assist the community in creating types such as Units or Categoricals * Allow strings to be used in functions such as ``np.equal`` or ``np.add``. * Remove legacy code paths within NumPy to improve long term maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
1. New machinery for array coercion, with the goal of enabling user DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility ----------------------
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
* **Python API**:
* ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
* **C-API**:
* In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
* The UFunc machinery changes will break *limited* parts of the current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
* All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
* **dtype implementors (C-API)**:
* The array which is currently provided to some functions (such as cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
* The ``scalarkind`` slot and registration of scalar casting will be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
* Currently user dtypes are defined as instances of ``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description --------------------
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} }
typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec;
PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec *dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
* The current UFunc type resolution must be adapted to allow better control for user-defined dtypes as well as resolve current inconsistencies. * The inner-loop used in UFuncs must be expanded to include a return value. Further, error reporting must be improved, and passing in dtype-specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion ----------
See NEP 40 for a list of previous meetings and discussions.
References ----------
.. [pandas_extension_arrays] https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright ---------
This document has been placed in the public domain.
Acknowledgments ---------------
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
Hi all,
in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP.
If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days.
To summarize one more time, the main point is that:
type(np.dtype(np.float64))
will be `np.dtype[float64]`, a subclass of dtype, so that:
issubclass(np.dtype[float64], np.dtype)
is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time.
Cheers
Sebastian
On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it: https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac... and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3... (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first
information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new
datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well- encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is either
impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and
integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in
NEPs 42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``,
with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be
instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide
implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3],
dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5),
np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user-defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype != np.result_type(meters,
seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP)
Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
Create a new and easily extensible API for defining new datatypes and related functionality. (NEP 42)
Incrementally define all necessary functionality through the new
API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from Python
scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
- make it possible to allow a `~numpy.ufunc` such as ``np.add``
to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for user-
defined datatypes.
* enable reuse of existing code. Units should be able to use the normal math loops and add additional logic to determine output
type.
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or
Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user
DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro
which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the
current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such as
cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-
copyswapn``,
may instead receive a dummy array object with only some fields
(mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will
be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of ``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and
mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better
control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a return
value. Further, error reporting must be improved, and passing in dtype- specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays] https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mar 17, 2020, at 1:02 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP.
One reason for not responding is that it seems a lot of discussion of this has already taken place and this NEP is presented more as a conclusion summary rather than a discussion point.
I implement scientific imaging software and overall this NEP looks useful.
My only caveat is that I don’t think tracking physical units should be a primary use case. Units are fundamentally different than data types, even though there are libraries out there that treat them more like data types.
For instance, it makes sense to have the same physical unit but with different storage types. For instance, data with nanometer physical units can be stored as a float32 or as an int16 and be equally useful.
In addition, a unit is something that is mutated by the operation. For instance, reducing a 2D image with physical units by a factor of two in each dimension produces a different unit scaling (1km/pixel goes to 2km/pixel); whereas cropping the center half does not (1km/pixel stays as 1km/pixel).
Finally, units may be different for each axis in multidimensional data. For instance, we want a float32 array with two dimensions with the units on one dimension being time and the other dimension being spatial. (3 seconds x 50 nm).
I’m not sure these comments take away from this NEP — but maybe there is another approach for units: metadata about the shape of the data rather than a new datatype for physical units. We do this in our software already - but it would be helpful if NumPy had a built-in mechanism for that.
On Tue, Mar 17, 2020 at 4:35 PM Chris Meyer cmeyer1969@gmail.com wrote:
On Mar 17, 2020, at 1:02 PM, Sebastian Berg sebastian@sipsolutions.net
wrote:
in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP.
One reason for not responding is that it seems a lot of discussion of this has already taken place and this NEP is presented more as a conclusion summary rather than a discussion point.
I implement scientific imaging software and overall this NEP looks useful.
My only caveat is that I don’t think tracking physical units should be a primary use case. Units are fundamentally different than data types, even though there are libraries out there that treat them more like data types.
I strongly disagree. Right now, you need to implement a custom container to handle units, which makes it exceedingly difficult to then properly interact with other array_like objects, like dask, pandas, and xarray; handling units is completely orthogonal to handling slicing operations, data access, etc. so having to implement a container is overkill. Unit information describes information about the type of each of the elements within an array, including describing how operations between individual elements work. This sounds exactly like a dtype to me.
For instance, it makes sense to have the same physical unit but with different storage types. For instance, data with nanometer physical units can be stored as a float32 or as an int16 and be equally useful.
Yes, you would have the unit tracking as a mixin that would allow different storage types, absolutely.
In addition, a unit is something that is mutated by the operation. For instance, reducing a 2D image with physical units by a factor of two in each dimension produces a different unit scaling (1km/pixel goes to 2km/pixel); whereas cropping the center half does not (1km/pixel stays as 1km/pixel).
I'm not sure what your point is. Dtypes can change for some operations (np.sqrt(np.arange(5)) produces a float) while staying the same for others (e.g. addition)
Finally, units may be different for each axis in multidimensional data. For instance, we want a float32 array with two dimensions with the units on one dimension being time and the other dimension being spatial. (3 seconds x 50 nm).
The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information.
Ryan
My only caveat is that I don’t think tracking physical units should be a primary use case. Units are fundamentally different than data types, even though there are libraries out there that treat them more like data types.
I strongly disagree. Right now, you need to implement a custom container to handle units, which makes it exceedingly difficult to then properly interact with other array_like objects, like dask, pandas, and xarray; handling units is completely orthogonal to handling slicing operations, data access, etc. so having to implement a container is overkill. Unit information describes information about the type of each of the elements within an array, including describing how operations between individual elements work. This sounds exactly like a dtype to me.
Yes I see your point.
Finally, units may be different for each axis in multidimensional data. For instance, we want a float32 array with two dimensions with the units on one dimension being time and the other dimension being spatial. (3 seconds x 50 nm).
The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information.
Again, you are correct. I’m asking for similar metadata attached to the data shape rather than the data type.
On Mar 17, 2020, at 4:50 PM, Chris Meyer cmeyer1969@gmail.com wrote:
The units for an array describe the elements *within* the array, they would have nothing to do with the dimensions. So for an array of image data, e.g. brightness temperatures, you would have physical units (e.g. Kelvin). You would have separate arrays of coordinates describing the spatial extent of the data along the relevant dimensions--each of these arrays of coordinates would have their own physical quantity information.
Again, you are correct. I’m asking for similar metadata attached to the data shape rather than the data type.
This would be better worded as "similar metadata attached to the data shape _in addition to_ the data type.”.
On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi all,
in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it?
Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents.
Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot.
If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP.
If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days.
I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come.
To summarize one more time, the main point is that:
This point seems fine, and I'm +1 for going ahead with the described parts of the technical design.
Cheers, Ralf
type(np.dtype(np.float64))
will be `np.dtype[float64]`, a subclass of dtype, so that:
issubclass(np.dtype[float64], np.dtype)
is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time.
Cheers
Sebastian
On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it:
https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac...
and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3...
(links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first
information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new
datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well- encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is either
impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and
integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in
NEPs 42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``,
with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be
instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide
implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3],
dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5),
np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user-defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype != np.result_type(meters,
seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP)
Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
Create a new and easily extensible API for defining new datatypes and related functionality. (NEP 42)
Incrementally define all necessary functionality through the new
API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from Python
scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
- make it possible to allow a `~numpy.ufunc` such as ``np.add``
to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for user-
defined datatypes.
* enable reuse of existing code. Units should be able to use the normal math loops and add additional logic to determine output
type.
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or
Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user
DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro
which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the
current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such as
cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-
copyswapn``,
may instead receive a dummy array object with only some fields
(mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will
be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of ``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and
mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better
control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a return
value. Further, error reporting must be improved, and passing in dtype- specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays]
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Thanks for publicizing this and all the work that has gone into getting this far.
I'm extremely supportive of the foundational DType meta-type and making dtypes classes. This was the epiphany I had in 2015 that led me to experiment with xnd and later mtypes. I have not had the funding to work on it much since that time directly.
But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 *all* user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.
Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach. Doing this first while trying to minimize other changes will help a lot. I'm very excited by the work being done in this direction.
I can appreciate the desire to be cautious on some of the other issues (like removing numpy array scalars). I do still think that eventually removing numpy array scalars in lieu of instances of dtype objects will be less complex approach and am not sold generally by the reasons listed in the NEP (though I can appreciate that it's not something to do as part of *this* NEP) as getting there might take more effort than desired at this point.
What I would *strongly* recommend right now, however, is to make the new NumPy dtype system a separately-installable module (kept in the NumPy GitHub organization). In that way, people can depend on the NumPy type system without depending on NumPy itself. I think this will become more and more important in the future. It will help the design as you see NumPy as one of many *consumers* of the type system instead of the only one. It would also help projects like arrow and xnd and others in the future that might only want to depend on NumPy's type system but otherwise implement their own computations.
This might require a little more work to provide an adaptor layer in NumPy itself to use the new system instead of its current dtypes, but I think it will also help ensure that the datatype API is cleaner and more useful to the Python ecosystem as a whole.
Thanks,
-Travis
On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it:
https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac... and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3... (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is either
impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and integer values should return a float value, is very valuable for numeric
datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in NEPs
42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``, with
most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be instances
of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide implementation
details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a list) >>> print(arr_15_dps) # Must find the correct precision from the
objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3], dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,
casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user-defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"], dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype != np.result_type(meters,
seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
Phase I: Restructure and extend the datatype infrastructure (This NEP)
- Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
Create a new and easily extensible API for defining new datatypes and related functionality. (NEP 42)
Incrementally define all necessary functionality through the new API
(NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from Python
scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
make it possible to allow a `~numpy.ufunc` such as ``np.add`` to be extended by user-defined datatypes such as Units.
allow efficient lookup for the correct implementation for
user-defined datatypes.
* enable reuse of existing code. Units should be able to use the normal math loops and add additional logic to determine output type.
Phase III: Growth of NumPy and Scientific Python Ecosystem capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user DTypes with appropriate class methods.
- The replacement or wrapping of the current casting machinery.
- Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, while
right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy
version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the current implementation. Replacing e.g. the default ``TypeResolver`` is
expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such as cast
functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin
types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of ``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and
mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a return value. Further, error reporting must be improved, and passing in dtype-specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays] https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi,
thanks for the feedback!
On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote:
Thanks for publicizing this and all the work that has gone into getting this far.
I'm extremely supportive of the foundational DType meta-type and making dtypes classes. This was the epiphany I had in 2015 that led me to experiment with xnd and later mtypes. I have not had the funding to work on it much since that time directly.
Right, I realize it is an old idea, if you have any references I am missing (I am sure there are many), I am happy to add them.
But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 *all* user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.
Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach. Doing this first while trying to minimize other changes will help a lot. I'm very excited by the work being done in this direction.
I can appreciate the desire to be cautious on some of the other issues (like removing numpy array scalars). I do still think that eventually removing numpy array scalars in lieu of instances of dtype objects will be less complex approach and am not sold generally by the reasons listed in the NEP (though I can appreciate that it's not something to do as part of *this* NEP) as getting there might take more effort than desired at this point.
Well, I do think it is a pretty strong design decision here though. If instances of DType classes are the actual dtypes (and not themselves classes, then it seems strange if scalars are also (direct) instances of the same DType class?
Of course we can and probably will allow `isinstance(scalar, DType)` to work in either case. I do not see a problem with that, although I do not feel like making that decision right now.
If we can agree on still going this direction for now I am happy of course. Nothing stops us from amending or finding new solutions in the future after all.
I used to love the idea, but to be honest, I currently do not see:
1. How to approach it. It would have to be within Python itself, or we would need more shims for Python builtin types? 2. That it is actually helpful for users.
If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well...
What I would *strongly* recommend right now, however, is to make the new NumPy dtype system a separately-installable module (kept in the NumPy GitHub organization). In that way, people can depend on the NumPy type system without depending on NumPy itself. I think this will become more and more important in the future. It will help the design as you see NumPy as one of many *consumers* of the type system instead of the only one. It would also help projects like arrow and xnd and others in the future that might only want to depend on NumPy's type system but otherwise implement their own computations.
Right, I agree that is the correct long term direction to see the DTypes as distinct from the NumPy array, and maybe I should add that to the NEP. What I am unsure about is the feasibility? If we develop it outside of NumPy, it harder to:
1. Use the new system without actually exposing it as public API in order to incrementally replace the old with a newer machinery. 2. It may require either exposing subclassing capabilities to NumPy to add shims for legacy DTypes right from the start, or add a bunch of public API which is only meant to be used within NumPy to that project?
I suppose, I am also not sure that having it in NumPy (at least for now) is actually all that bad? For array-likes it is probably not a the most heavy dependency (and it could be slimmed down into a core).
Since the intention is to dog-feed the API as much as possible and to limit the public API, it should be plausible to rip it out later of course. I am sure that will be more overall effort, but I suppose I feel it is much more approachable effort.
One thing I would like is for projects such as CuPy to be able to subclass DTypes at some point to tag on the GPU aware things they need. But in some sense the basic DTypes seem to require being tied in with NumPy? They must be associated with the NumPy scalars, and the basic methods defined for all DTypes (also user DTypes) will probably be strided-inner-loops on the CPU.
This might require a little more work to provide an adaptor layer in NumPy itself to use the new system instead of its current dtypes, but I think it will also help ensure that the datatype API is cleaner and more useful to the Python ecosystem as a whole.
While I fully agree with the sentiment, I suppose I am scared that the little more work will end up being too much :(. We have pretty limited resources and the most difficult work will not be writing the DType API itself. It will be wrangling it into NumPy and the associated huge review effort to get it right. Only by actually wrangling it into NumPy, I think we can also get the API fully right to begin with. So, I am scared that moving development outside and trying to add the more global scope at this time as will make the NumPy side much more difficult :(. Maybe not even because it is actually much trickier, but again because it seems less tangible/approachable.
So, my main point here is that we have to make this large refactor as approachable as possible, and if that means that at some point someone has to spend a huge, but hopefully straight forward effort, to rip DTypes out of NumPy, I think that might be a worthy trade-off. Unless we can activate significantly larger resources very quickly.
Best,
Sebastian
Thanks,
-Travis
On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < sebastian@sipsolutions.net> wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it:
https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac... and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3... (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first
information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new
datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is
either impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and
integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in
NEPs 42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``,
with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be
instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide
implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3],
dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5),
np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user- defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype !=
np.result_type(meters, seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP)
Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
- Create a new and easily extensible API for defining new
datatypes and related functionality. (NEP 42)
- Incrementally define all necessary functionality through the
new API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from
Python scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
- make it possible to allow a `~numpy.ufunc` such as ``np.add``
to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for
user-defined datatypes.
* enable reuse of existing code. Units should be able to use
the normal math loops and add additional logic to determine output type.
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or
Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user
DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro
which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the
current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such
as cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will
be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of
``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better
control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a
return value. Further, error reporting must be improved, and passing in dtype- specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays] https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sun, Mar 22, 2020 at 1:33 PM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi,
thanks for the feedback!
On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote:
Thanks for publicizing this and all the work that has gone into getting this far.
I'm extremely supportive of the foundational DType meta-type and making dtypes classes. This was the epiphany I had in 2015 that led me to experiment with xnd and later mtypes. I have not had the funding to work on it much since that time directly.
Right, I realize it is an old idea, if you have any references I am missing (I am sure there are many), I am happy to add them.
But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 *all* user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.
Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach. Doing this first while trying to minimize other changes will help a lot. I'm very excited by the work being done in this direction.
I can appreciate the desire to be cautious on some of the other issues (like removing numpy array scalars). I do still think that eventually removing numpy array scalars in lieu of instances of dtype objects will be less complex approach and am not sold generally by the reasons listed in the NEP (though I can appreciate that it's not something to do as part of *this* NEP) as getting there might take more effort than desired at this point.
Well, I do think it is a pretty strong design decision here though. If instances of DType classes are the actual dtypes (and not themselves classes, then it seems strange if scalars are also (direct) instances of the same DType class?
<snip>
If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well...
I agree that it makes sense to wait and do as the NEP says --- not require that to begin with. The persuasive argument to me is that as far as NumPy is concerned it is too many changes at once, and all the repercussions are not fully understood. So, it's better to wait.
But, if another container system were to take the dtypes and make scalars instances of those types, it could actually work.
At any rate, I agree with the decision and current plan to *not* make that change too.
What I would *strongly* recommend right now, however, is to make the new NumPy dtype system a separately-installable module (kept in the NumPy GitHub organization). In that way, people can depend on the NumPy type system without depending on NumPy itself. I think this will become more and more important in the future. It will help the design as you see NumPy as one of many *consumers* of the type system instead of the only one. It would also help projects like arrow and xnd and others in the future that might only want to depend on NumPy's type system but otherwise implement their own computations.
Right, I agree that is the correct long term direction to see the DTypes as distinct from the NumPy array, and maybe I should add that to the NEP. What I am unsure about is the feasibility? If we develop it outside of NumPy, it harder to:
- Use the new system without actually exposing it as public API in
order to incrementally replace the old with a newer machinery. 2. It may require either exposing subclassing capabilities to NumPy to add shims for legacy DTypes right from the start, or add a bunch of public API which is only meant to be used within NumPy to that project?
I suppose, I am also not sure that having it in NumPy (at least for now) is actually all that bad? For array-likes it is probably not a the most heavy dependency (and it could be slimmed down into a core).
Since the intention is to dog-feed the API as much as possible and to limit the public API, it should be plausible to rip it out later of course. I am sure that will be more overall effort, but I suppose I feel it is much more approachable effort.
That makes sense. I'd love to see in the NEP some discussion of the possibility down the road of making a distinct module. It is more work initially to do that. I can see your point about not being sure about the public APIs at this point, but on the other hand, there is nothing like having two consumers to force that issue and make the overall design better.
For example, if pyarrow, pytorch, and NumPy were to collaborate on this "dtype" module, I think the result would be stronger for both. But, it would require more funding and this NEP could allow for the possibility but not propose it directly.
One thing I would like is for projects such as CuPy to be able to subclass DTypes at some point to tag on the GPU aware things they need. But in some sense the basic DTypes seem to require being tied in with NumPy? They must be associated with the NumPy scalars, and the basic methods defined for all DTypes (also user DTypes) will probably be strided-inner-loops on the CPU.
This might require a little more work to provide an adaptor layer in NumPy itself to use the new system instead of its current dtypes, but I think it will also help ensure that the datatype API is cleaner and more useful to the Python ecosystem as a whole.
While I fully agree with the sentiment, I suppose I am scared that the little more work will end up being too much :(. We have pretty limited resources and the most difficult work will not be writing the DType API itself. It will be wrangling it into NumPy and the associated huge review effort to get it right. Only by actually wrangling it into NumPy, I think we can also get the API fully right to begin with. So, I am scared that moving development outside and trying to add the more global scope at this time as will make the NumPy side much more difficult :(. Maybe not even because it is actually much trickier, but again because it seems less tangible/approachable.
So, my main point here is that we have to make this large refactor as approachable as possible, and if that means that at some point someone has to spend a huge, but hopefully straight forward effort, to rip DTypes out of NumPy, I think that might be a worthy trade-off. Unless we can activate significantly larger resources very quickly.
That is a very reasonable response. I really appreciate the hard work you and others have put into this and am extremely grateful for the funding that has been provided thus far. I am interested in trying to get more funding and a common data-type API is one of those things I'm interested in helping see emerge.
What you are doing here in NumPy is the closest thing to the "right" approach I have seen that has a chance of seeing adoption.
Retrofitting NumPy is a lot of work on its own, and it is entirely possible that the kinds of attributes, methods and functions that are needed for NumPy means that an independent typing module is "easy" but not really useful for NumPy. It could be a straight-forward thing later to make what you produce either a formal subtype or simply an abstract duck-type of some other data-type system.
As you make progress in this direction in NumPy, I'll keep following your work and maybe we can get more funding to work on something in this direction.
Thanks so much,
-Travis
Best,
Sebastian
Thanks,
-Travis
On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < sebastian@sipsolutions.net> wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it:
https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac...
and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3...
(links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first
information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new
datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is
either impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and
integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in
NEPs 42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``,
with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be
instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide
implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3],
dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5),
np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user- defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype !=
np.result_type(meters, seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP)
Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
- Create a new and easily extensible API for defining new
datatypes and related functionality. (NEP 42)
- Incrementally define all necessary functionality through the
new API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from
Python scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
- make it possible to allow a `~numpy.ufunc` such as ``np.add``
to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for
user-defined datatypes.
* enable reuse of existing code. Units should be able to use
the normal math loops and add additional logic to determine output type.
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or
Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user
DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro
which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the
current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such
as cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will
be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of
``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better
control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a
return value. Further, error reporting must be improved, and passing in dtype- specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays]
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sun, Mar 22, 2020 at 7:33 PM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi,
thanks for the feedback!
On Sat, 2020-03-21 at 15:58 -0500, Travis Oliphant wrote:
Thanks for publicizing this and all the work that has gone into getting this far.
I'm extremely supportive of the foundational DType meta-type and making dtypes classes. This was the epiphany I had in 2015 that led me to experiment with xnd and later mtypes. I have not had the funding to work on it much since that time directly.
Right, I realize it is an old idea, if you have any references I am missing (I am sure there are many), I am happy to add them.
But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 *all* user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.
Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach. Doing this first while trying to minimize other changes will help a lot. I'm very excited by the work being done in this direction.
I can appreciate the desire to be cautious on some of the other issues (like removing numpy array scalars). I do still think that eventually removing numpy array scalars in lieu of instances of dtype objects will be less complex approach and am not sold generally by the reasons listed in the NEP (though I can appreciate that it's not something to do as part of *this* NEP) as getting there might take more effort than desired at this point.
Well, I do think it is a pretty strong design decision here though. If instances of DType classes are the actual dtypes (and not themselves classes, then it seems strange if scalars are also (direct) instances of the same DType class?
Of course we can and probably will allow `isinstance(scalar, DType)` to work in either case. I do not see a problem with that, although I do not feel like making that decision right now.
If we can agree on still going this direction for now I am happy of course. Nothing stops us from amending or finding new solutions in the future after all.
I used to love the idea, but to be honest, I currently do not see:
- How to approach it. It would have to be within Python itself, or we
would need more shims for Python builtin types? 2. That it is actually helpful for users.
If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well...
I have had a glance at what you are after, and it seems challenging indeed. IMO, trying to cope with everybody's need regarding data types is extremely costly (or more simply, not even possible). I think that a better approach would be to decouple the storage part of a container from its data type system. In the storage there should go things needed to cope with the data retrieval, like the itemsize, the shape, or even other sub-shapes for chunked datasets. Then, in the data type layer, one should be able to add meaning to the raw data: is that an integer? speed? temperature? a compound type?
Indeed the data storage layer should be able to provide a way to store the data type representation so that a container can be serialized and deserialized correctly. But the important thing here is that this decoupling between storage and types allows for different data type systems, so that anyone can come with a specific type system depending on her needs. One can envision here even a basic data type system (e.g. a version of what's now supported in NumPy) that can be extended with other layers, depending on the needs, so that every community can interchange data at a basic level at least.
As an example, this is the basic spirit behind the under-construction Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( https://github.com/Blosc/C-Blosc2) will be providing the low-level storage layer, with no information about dimensionality. Caterva will be building the multidimensional layer, but with no information about the types at all. On top of this scaffold, third-party layers will be free to build their own data dtypes, specific for every domain (the concept is imaged in slide 18 of this presentation: https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing to prevent to add more layers, or even a two-layer (and preferably no more than two-level) data type system: one for simple data types (e.g. NumPy ones) and one meant to be more domain-specific.
Right now, I think that the limitation that is keeping the NumPy community thinking in terms of blending storage and types in the same layer is that NumPy is providing a computational engine too, and for doing computations, one need to provide both storage (including dimensionality info) and type information indeed. By using the multi-layer approach, there should be a computational layer that is laid out on top of the storage and the type layers, and hence, specific for leveraging them. Sure, that is a big departure from what we are used to, but as long as one can keep the architecture of the different layers simple, one could see interesting results in not that long time.
Just my 2 cents, Francesc
What I would *strongly* recommend right now, however, is to make the new NumPy dtype system a separately-installable module (kept in the NumPy GitHub organization). In that way, people can depend on the NumPy type system without depending on NumPy itself. I think this will become more and more important in the future. It will help the design as you see NumPy as one of many *consumers* of the type system instead of the only one. It would also help projects like arrow and xnd and others in the future that might only want to depend on NumPy's type system but otherwise implement their own computations.
Right, I agree that is the correct long term direction to see the DTypes as distinct from the NumPy array, and maybe I should add that to the NEP. What I am unsure about is the feasibility? If we develop it outside of NumPy, it harder to:
- Use the new system without actually exposing it as public API in
order to incrementally replace the old with a newer machinery. 2. It may require either exposing subclassing capabilities to NumPy to add shims for legacy DTypes right from the start, or add a bunch of public API which is only meant to be used within NumPy to that project?
I suppose, I am also not sure that having it in NumPy (at least for now) is actually all that bad? For array-likes it is probably not a the most heavy dependency (and it could be slimmed down into a core).
Since the intention is to dog-feed the API as much as possible and to limit the public API, it should be plausible to rip it out later of course. I am sure that will be more overall effort, but I suppose I feel it is much more approachable effort.
One thing I would like is for projects such as CuPy to be able to subclass DTypes at some point to tag on the GPU aware things they need. But in some sense the basic DTypes seem to require being tied in with NumPy? They must be associated with the NumPy scalars, and the basic methods defined for all DTypes (also user DTypes) will probably be strided-inner-loops on the CPU.
This might require a little more work to provide an adaptor layer in NumPy itself to use the new system instead of its current dtypes, but I think it will also help ensure that the datatype API is cleaner and more useful to the Python ecosystem as a whole.
While I fully agree with the sentiment, I suppose I am scared that the little more work will end up being too much :(. We have pretty limited resources and the most difficult work will not be writing the DType API itself. It will be wrangling it into NumPy and the associated huge review effort to get it right. Only by actually wrangling it into NumPy, I think we can also get the API fully right to begin with. So, I am scared that moving development outside and trying to add the more global scope at this time as will make the NumPy side much more difficult :(. Maybe not even because it is actually much trickier, but again because it seems less tangible/approachable.
So, my main point here is that we have to make this large refactor as approachable as possible, and if that means that at some point someone has to spend a huge, but hopefully straight forward effort, to rip DTypes out of NumPy, I think that might be a worthy trade-off. Unless we can activate significantly larger resources very quickly.
Best,
Sebastian
Thanks,
-Travis
On Wed, Mar 11, 2020 at 7:08 PM Sebastian Berg < sebastian@sipsolutions.net> wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it:
https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac...
and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3...
(links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
================================================= NEP 41 — First step towards a new Datatype System =================================================
:title: Improved Datatype Support :Author: Sebastian Berg :Author: Stéfan van der Walt :Author: Matti Picus :Status: Draft :Type: Standard Track :Created: 2020-02-03
.. note::
This NEP is part of a series of NEPs encompassing first
information about the previous dtype implementation and issues with it in NEP 40. NEP 41 (this document) then provides an overview and generic design choices for the refactor. Further NEPs 42 and 43 go into the technical details of the datatype and universal function related internal and external API changes. In some cases it may be necessary to consult the other NEPs for a full picture of the desired changes and why these changes are necessary.
Abstract
`Datatypes <data-type-objects-dtype>` in NumPy describe how to interpret each element in arrays. NumPy provides ``int``, ``float``, and ``complex`` numerical types, as well as string, datetime, and structured datatype capabilities. The growing Python community, however, has need for more diverse datatypes. Examples are datatypes with unit information attached (such as meters) or categorical datatypes (fixed set of possible values). However, the current NumPy datatype API is too limited to allow the creation of these.
This NEP is the first step to enable such growth; it will lead to a simpler development path for new datatypes. In the long run the new datatype system will also support the creation of datatypes directly from Python rather than C. Refactoring the datatype API will improve maintainability and facilitate development of both user-defined external datatypes, as well as new features for existing datatypes internal to NumPy.
Motivation and Scope
.. seealso::
The user impact section includes examples of what kind of new
datatypes will be enabled by the proposed changes in the long run. It may thus help to read these section out of order.
Motivation ^^^^^^^^^^
One of the main issues with the current API is the definition of typical functions such as addition and multiplication for parametric datatypes (see also NEP 40) which require additional steps to determine the output type. For example when adding two strings of length 4, the result is a string of length 8, which is different from the input. Similarly, a datatype which embeds a physical unit must calculate the new unit information: dividing a distance by a time results in a speed. A related difficulty is that the :ref:`current casting rules <_ufuncs.casting>` -- the conversion between different datatypes -- cannot describe casting for such parametric datatypes implemented outside of NumPy.
This additional functionality for supporting parametric datatypes introduces increased complexity within NumPy itself, and furthermore is not available to external user-defined datatypes. In general the concerns of different datatypes are not well well-encapsulated. This burden is exacerbated by the exposure of internal C structures, limiting the addition of new fields (for example to support new sorting methods [new_sort]_).
Currently there are many factors which limit the creation of new user-defined datatypes:
- Creating casting rules for parametric user-defined dtypes is
either impossible or so complex that it has never been attempted.
- Type promotion, e.g. the operation deciding that adding float and
integer values should return a float value, is very valuable for numeric datatypes but is limited in scope for user-defined and especially parametric datatypes.
- Much of the logic (e.g. promotion) is written in single functions instead of being split as methods on the datatype itself.
- In the current design datatypes cannot have methods that do not
generalize to other datatypes. For example a unit datatype cannot have a ``.to_si()`` method to easily find the datatype which would represent the same values in SI units.
The large need to solve these issues has driven the scientific community to create work-arounds in multiple projects implementing physical units as an array-like class instead of a datatype, which would generalize better across multiple array-likes (Dask, pandas, etc.). Already, Pandas has made a push into the same direction with its extension arrays [pandas_extension_arrays]_ and undoubtedly the community would be best served if such new features could be common between NumPy, Pandas, and other projects.
Scope ^^^^^
The proposed refactoring of the datatype system is a large undertaking and thus is proposed to be split into various phases, roughly:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP 41)
- Phase II: Incrementally define or rework API (Detailed largely in
NEPs 42/43)
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities.
For a more detailed accounting of the various phases, see "Plan to Approach the Full Refactor" in the Implementation section below. This NEP proposes to move ahead with the necessary creation of new dtype subclasses (Phase I), and start working on implementing current functionality. Within the context of this NEP all development will be fully private API or use preliminary underscored names which must be changed in the future. Most of the internal and public API choices are part of a second Phase and will be discussed in more detail in the following NEPs 42 and 43. The initial implementation of this NEP will have little or no effect on users, but provides the necessary ground work for incrementally addressing the full rework.
The implementation of this NEP and the following, implied large rework of how datatypes are defined in NumPy is expected to create small incompatibilities (see backward compatibility section). However, a transition requiring large code adaption is not anticipated and not within scope.
Specifically, this NEP makes the following design choices which are discussed in more details in the detailed description section:
- Each datatype will be an instance of a subclass of ``np.dtype``,
with most of the datatype-specific logic being implemented as special methods on the class. In the C-API, these correspond to specific slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, np.dtype)`` will remain true, but ``type(f)`` will be a subclass of ``np.dtype`` rather than just ``np.dtype`` itself. The ``PyArray_ArrFuncs`` which are currently stored as a pointer on the instance (as ``PyArray_Descr->f``), should instead be stored on the class as typically done in Python. In the future these may correspond to python side dunder methods. Storage information such as itemsize and byteorder can differ between different dtype instances (e.g. "S3" vs. "S8") and will remain part of the instance. This means that in the long run the current lowlevel access to dtype methods will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
- The current NumPy scalars will *not* change, they will not be
instances of datatypes. This will also be true for new datatypes, scalars will not be instances of a dtype (although ``isinstance(scalar, dtype)`` may be made to return ``True`` when appropriate).
Detailed technical decisions to follow in NEP 42.
Further, the public API will be designed in a way that is extensible in the future:
- All new C-API functions provided to the user will hide
implementation details as much as possible. The public API should be an identical, but limited, version of the C-API used for the internal NumPy datatypes.
The changes to the datatype system in Phase II must include a large refactor of the UFunc machinery, which will be further defined in NEP 43:
- To enable all of the desired functionality for new user-defined
datatypes, the UFunc machinery will be changed to replace the current dispatching and type resolution system. The old system should be *mostly* supported as a legacy version for some time.
Additionally, as a general design principle, the addition of new user-defined datatypes will *not* change the behaviour of programs. For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or ``b`` know that ``c`` exists.
User Impact
The current ecosystem has very few user-defined datatypes using NumPy, the two most prominent being: ``rational`` and ``quaternion``. These represent fairly simple datatypes which are not strongly impacted by the current limitations. However, we have identified a need for datatypes such as:
- bfloat16, used in deep learning
- categorical types
- physical units (such as meters)
- datatypes for tracing/automatic differentiation
- high, fixed precision math
- specialized integer types such as int2, int24
- new, better datetime representations
- extending e.g. integer dtypes to have a sentinel NA value
- geometrical objects [pygeos]_
Some of these are partially solved; for example unit capability is provided in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` subclasses. Most of these datatypes, however, simply cannot be reasonably defined right now. An advantage of having such datatypes in NumPy is that they should integrate seamlessly with other array or array-like packages such as Pandas, ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
The long term user impact of implementing this NEP will be to allow both the growth of the whole ecosystem by having such new datatypes, as well as consolidating implementation of such datatypes within NumPy to achieve better interoperability.
Examples ^^^^^^^^
The following examples represent future user-defined datatypes we wish to enable. These datatypes are not part the NEP and choices (e.g. choice of casting rules) are possibilities we wish to enable and do not represent recommendations.
Simple Numerical Types """"""""""""""""""""""
Mainly used where memory is a consideration, lower-precision numeric types such as :ref:```bfloat16`` < https://en.wikipedia.org/wiki/Bfloat16_floating-point_format%3E%60 are common in other computational frameworks. For these types the definitions of things such as ``np.common_type`` and ``np.can_cast`` are some of the most important interfaces. Once they support ``np.common_type``, it is (for the most part) possible to find the correct ufunc loop to call, since most ufuncs -- such as add -- effectively only require ``np.result_type``::
>>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
and `~numpy.result_type` is largely identical to `~numpy.common_type`.
Fixed, high precision math """"""""""""""""""""""""""
Allowing arbitrary precision or higher precision math is important in simulations. For instance ``mpmath`` defines a precision::
>>> import mpmath as mp >>> print(mp.dps) # the current (default) precision 15
NumPy should be able to construct a native, memory-efficient array from a list of ``mpmath.mpf`` floating point objects::
>>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
list) >>> print(arr_15_dps) # Must find the correct precision from the objects: array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
We should also be able to specify the desired precision when creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` to find the DType class (the notation is not part of this NEP), which is then instantiated with the desired parameter. This could also be written as ``MpfDType`` class::
>>> arr_100_dps = np.array([1, 2, 3],
dtype=np.dtype[mp.mpf](dps=100)) >>> print(arr_15_dps + arr_100_dps) array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
The ``mpf`` datatype can decide that the result of the operation should be the higher precision one of the two, so uses a precision of 100. Furthermore, we should be able to define casting, for example as in::
>>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
casting="safe") True >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, casting="safe") False # loses precision >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, casting="same_kind") True
Casting from float is a probably always at least a ``same_kind`` cast, but in general, it is not safe::
>>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
casting="safe") False
since a float64 has a higer precision than the ``mpf`` datatype with ``dps=4``.
Alternatively, we can say that::
>>> np.common_type(np.dtype[mp.mpf](dps=5),
np.dtype[mp.mpf](dps=10)) np.dtype[mp.mpf](dps=10)
And possibly even::
>>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
believe)
since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` safely.
Categoricals """"""""""""
Categoricals are interesting in that they can have fixed, predefined values, or can be dynamic with the ability to modify categories when necessary. The fixed categories (defined ahead of time) is the most straight forward categorical definition. Categoricals are *hard*, since there are many strategies to implement them, suggesting NumPy should only provide the scaffolding for user- defined categorical types. For instance::
>>> cat = Categorical(["eggs", "spam", "toast"]) >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
dtype=cat)
could store the array very efficiently, since it knows that there are only 3 categories. Since a categorical in this sense knows almost nothing about the data stored in it, few operations makes, sense, although equality does:
>>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
dtype=cat) >>> breakfast == breakfast2 array[True, False, True, False])
The categorical datatype could work like a dictionary: no two items names can be equal (checked on dtype creation), so that the equality operation above can be performed very efficiently. If the values define an order, the category labels (internally integers) could be ordered the same way to allow efficient sorting and comparison.
Whether or not casting is defined from one categorical with less to one with strictly more values defined, is something that the Categorical datatype would need to decide. Both options should be available.
Unit on the Datatype """"""""""""""""""""
There are different ways to define Units, depending on how the internal machinery would be organized, one way is to have a single Unit datatype for every existing numerical type. This will be written as ``Unit[float64]``, the unit itself is part of the DType instance ``Unit[float64]("m")`` is a ``float64`` with meters attached::
>>> from astropy import units >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
meters >>> print(meters) array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
Note that units are a bit tricky. It is debatable, whether::
>>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
should be valid syntax (coercing the float scalars without a unit to meters). Once the array is created, math will work without any issue::
>>> meters / (2 * unit.seconds) array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
Casting is not valid from one unit to the other, but can be valid between different scales of the same dimensionality (although this may be "unsafe")::
>>> meters.astype(Unit[float64]("s")) TypeError: Cannot cast meters to seconds. >>> meters.astype(Unit[float64]("km")) >>> # Convert to centimeter-gram-second (cgs) units: >>> meters.astype(meters.dtype.to_cgs())
The above notation is somewhat clumsy. Functions could be used instead to convert between units. There may be ways to make these more convenient, but those must be left for future discussions::
>>> units.convert(meters, "km") >>> units.to_cgs(meters)
There are some open questions. For example, whether additional methods on the array object could exist to simplify some of the notions, and how these would percolate from the datatype to the ``ndarray``.
The interaction with other scalars would likely be defined through::
>>> np.common_type(np.float64, Unit) Unit[np.float64](dimensionless)
Ufunc output datatype determination can be more involved than for simple numerical dtypes since there is no "universal" output type::
>>> np.multiply(meters, seconds).dtype !=
np.result_type(meters, seconds)
In fact ``np.result_type(meters, seconds)`` must error without context of the operation being done. This example highlights how the specific ufunc loop (loop with known, specific DTypes as inputs), has to be able to to make certain decisions before the actual calculation can start.
Implementation
Plan to Approach the Full Refactor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To address these issues in NumPy and enable new datatypes, multiple development stages are required:
- Phase I: Restructure and extend the datatype infrastructure (This
NEP)
Organize Datatypes like normal Python classes [`PR 15508`]_
Phase II: Incrementally define or rework API
- Create a new and easily extensible API for defining new
datatypes and related functionality. (NEP 42)
- Incrementally define all necessary functionality through the
new API (NEP 42):
* Defining operations such as ``np.common_type``. * Allowing to define casting between datatypes. * Add functionality necessary to create a numpy array from
Python scalars (i.e. ``np.array(...)``). * …
Restructure how universal functions work (NEP 43), in order to:
- make it possible to allow a `~numpy.ufunc` such as ``np.add``
to be extended by user-defined datatypes such as Units.
* allow efficient lookup for the correct implementation for
user-defined datatypes.
* enable reuse of existing code. Units should be able to use
the normal math loops and add additional logic to determine output type.
- Phase III: Growth of NumPy and Scientific Python Ecosystem
capabilities:
- Cleanup of legacy behaviour where it is considered buggy or
undesirable.
- Provide a path to define new datatypes from Python.
- Assist the community in creating types such as Units or
Categoricals
- Allow strings to be used in functions such as ``np.equal`` or
``np.add``.
- Remove legacy code paths within NumPy to improve long term
maintainability
This document serves as a basis for phase I and provides the vision and motivation for the full project. Phase I does not introduce any new user-facing features, but is concerned with the necessary conceptual cleanup of the current datatype system. It provides a more "pythonic" datatype Python type object, with a clear class hierarchy.
The second phase is the incremental creation of all APIs necessary to define fully featured datatypes and reorganization of the NumPy datatype system. This phase will thus be primarily concerned with defining an, initially preliminary, stable public API.
Some of the benefits of a large refactor may only become evident after the full deprecation of the current legacy implementation (i.e. larger code removals). However, these steps are necessary for improvements to many parts of the core NumPy API, and are expected to make the implementation generally easier to understand.
The following figure illustrates the proposed design at a high level, and roughly delineates the components of the overall design. Note that this NEP only regards Phase I (shaded area), the rest encompasses Phase II and the design choices are up for discussion, however, it highlights that the DType datatype class is the central, necessary concept:
.. image:: _static/nep-0041-mindmap.svg
First steps directly related to this NEP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The required changes necessary to NumPy are large and touch many areas of the code base but many of these changes can be addressed incrementally.
To enable an incremental approach we will start by creating a C defined ``PyArray_DTypeMeta`` class with its instances being the ``DType`` classes, subclasses of ``np.dtype``. This is necessary to add the ability of storing custom slots on the DType in C. This ``DTypeMeta`` will be implemented first to then enable incremental restructuring of current code.
The addition of ``DType`` will then enable addressing other changes incrementally, some of which may begin before the settling the full internal API:
- New machinery for array coercion, with the goal of enabling user
DTypes with appropriate class methods. 2. The replacement or wrapping of the current casting machinery. 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots into DType method slots.
At this point, no or only very limited new public API will be added and the internal API is considered to be in flux. Any new public API may be set up give warnings and will have leading underscores to indicate that it is not finalized and can be changed without warning.
Backward compatibility
While the actual backward compatibility impact of implementing Phase I and II are not yet fully clear, we anticipate, and accept the following changes:
**Python API**:
- ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
while right now ``type(np.dtype("f8")) is np.dtype``. Code should use ``isinstance`` checks, and in very rare cases may have to be adapted to use it.
**C-API**:
- In old versions of NumPy ``PyArray_DescrCheck`` is a macro
which uses ``type(dtype) is np.dtype``. When compiling against an old NumPy version, the macro may have to be replaced with the corresponding ``PyObject_IsInstance`` call. (If this is a problem, we could backport fixing the macro)
- The UFunc machinery changes will break *limited* parts of the
current implementation. Replacing e.g. the default ``TypeResolver`` is expected to remain supported for a time, although optimized masked inner loop iteration (which is not even used *within* NumPy) will no longer be supported.
- All functions currently defined on the dtypes, such as ``PyArray_Descr->f->nonzero``, will be defined and accessed
differently. This means that in the long run lowlevel access code will have to be changed to use the new API. Such changes are expected to be necessary in very few project.
**dtype implementors (C-API)**:
- The array which is currently provided to some functions (such
as cast functions), will no longer be provided. For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f->copyswapn``, may instead receive a dummy array object with only some fields (mainly the dtype), being valid. At least in some code paths, a similar mechanism is already used.
- The ``scalarkind`` slot and registration of scalar casting will
be removed/ignored without replacement. It currently allows partial value-based casting. The ``PyArray_ScalarKind`` function will continue to work for builtin types, but will not be used internally and be deprecated.
- Currently user dtypes are defined as instances of
``np.dtype``. The creation works by the user providing a prototype instance. NumPy will need to modify at least the type during registration. This has no effect for either ``rational`` or ``quaternion`` and mutation of the structure seems unlikely after registration.
Since there is a fairly large API surface concerning datatypes, further changes or the limitation certain function to currently existing datatypes is likely to occur. For example functions which use the type number as input should be replaced with functions taking DType classes instead. Although public, large parts of this C-API seem to be used rarely, possibly never, by downstream projects.
Detailed Description
This section details the design decisions covered by this NEP. The subsections correspond to the list of design choices presented in the Scope section.
Datatypes as Python Classes (1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current NumPy datatypes are not full scale python classes. They are instead (prototype) instances of a single ``np.dtype`` class. Changing this means that any special handling, e.g. for ``datetime`` can be moved to the Datetime DType class instead, away from monolithic general code (e.g. current ``PyArray_AdjustFlexibleDType``).
The main consequence of this change with respect to the API is that special methods move from the dtype instances to methods on the new DType class. This is the typical design pattern used in Python. Organizing these methods and information in a more Pythonic way provides a solid foundation for refining and extending the API in the future. The current API cannot be extended due to how it is exposed publically. This means for example that the methods currently stored in ``PyArray_ArrFuncs`` on each datatype (see NEP 40) will be defined differently in the future and deprecated in the long run.
The most prominent visible side effect of this will be that ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. Instead it will be a subclass of ``np.dtype`` meaning that ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. This will also add the ability to use ``isinstance(dtype, np.dtype[float64])`` thus removing the need to use ``dtype.kind``, ``dtype.char``, or ``dtype.type`` to do this check.
With the design decision of DTypes as full-scale Python classes, the question of subclassing arises. Inheritance, however, appears problematic and a complexity best avoided (at least initially) for container datatypes. Further, subclasses may be more interesting for interoperability for example with GPU backends (CuPy) storing additional methods related to the GPU rather than as a mechanism to define new datatypes. A class hierarchy does provides value, this may be achieved by allowing the creation of *abstract* datatypes. An example for an abstract datatype would be the datatype equivalent of ``np.floating``, representing any floating point number. These can serve the same purpose as Python's abstract base classes.
Scalars should not be instances of the datatypes (2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For simple datatypes such as ``float64`` (see also below), it seems tempting that the instance of a ``np.dtype("float64")`` can be the scalar. This idea may be even more appealing due to the fact that scalars, rather than datatypes, currently define a useful type hierarchy.
However, we have specifically decided against this for a number of reasons. First, the new datatypes described herein would be instances of DType classes. Making these instances themselves classes, while possible, adds additional complexity that users need to understand. It would also mean that scalars must have storage information (such as byteorder) which is generally unnecessary and currently is not used. Second, while the simple NumPy scalars such as ``float64`` may be such instances, it should be possible to create datatypes for Python objects without enforcing NumPy as a dependency. However, Python objects that do not depend on NumPy cannot be instances of a NumPy DType. Third, there is a mismatch between the methods and attributes which are useful for scalars and datatypes. For instance ``to_float()`` makes sense for a scalar but not for a datatype and ``newbyteorder`` is not useful on a scalar (or has a different meaning).
Overall, it seem rather than reducing the complexity, i.e. by merging the two distinct type hierarchies, making scalars instances of DTypes would increase the complexity of both the design and implementation.
A possible future path may be to instead simplify the current NumPy scalars to be much simpler objects which largely derive their behaviour from the datatypes.
C-API for creating new Datatypes (3) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current C-API with which users can create new datatypes is limited in scope, and requires use of "private" structures. This means the API is not extensible: no new members can be added to the structure without losing binary compatibility. This has already limited the inclusion of new sorting methods into NumPy [new_sort]_.
The new version shall thus replace the current ``PyArray_ArrFuncs`` structure used to define new datatypes. Datatypes that currently exist and are defined using these slots will be supported during a deprecation period.
The most likely solution is to hide the implementation from the user and thus make it extensible in the future is to model the API after Python's stable API [PEP-384]_:
.. code-block:: C
static struct PyArrayMethodDef slots[] = { {NPY_dt_method, method_implementation}, ..., {0, NULL} } typedef struct{ PyTypeObject *typeobj; /* type of python scalar */ ...; PyType_Slot *slots; } PyArrayDTypeMeta_Spec; PyObject* PyArray_InitDTypeMetaFromSpec( PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
*dtype_spec);
The C-side slots should be designed to mirror Python side methods such as ``dtype.__dtype_method__``, although the exposure to Python is a later step in the implementation to reduce the complexity of the initial implementation.
C-API Changes to the UFunc Machinery (4) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proposed changes to the UFunc machinery will be part of NEP 43. However, the following changes will be necessary (see NEP 40 for a detailed description of the current implementation and its issues):
- The current UFunc type resolution must be adapted to allow better
control for user-defined dtypes as well as resolve current inconsistencies.
- The inner-loop used in UFuncs must be expanded to include a
return value. Further, error reporting must be improved, and passing in dtype- specific information enabled. This requires the modification of the inner-loop function signature and addition of new hooks called before and after the inner-loop is used.
An important goal for any changes to the universal functions will be to allow the reuse of existing loops. It should be easy for a new units datatype to fall back to existing math functions after handling the unit related computations.
Discussion
See NEP 40 for a list of previous meetings and discussions.
References
.. [pandas_extension_arrays]
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#exte...
.. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
.. [pygeos] https://github.com/caspervdw/pygeos
.. [new_sort] https://github.com/numpy/numpy/pull/12945
.. [PEP-384] https://www.python.org/dev/peps/pep-0384/
.. [PR 15508] https://github.com/numpy/numpy/pull/15508
Copyright
This document has been placed in the public domain.
Acknowledgments
The effort to create new datatypes for NumPy has been discussed for several years in many different contexts and settings, making it impossible to list everyone involved. We would like to thank especially Stephan Hoyer, Nathaniel Smith, and Eric Wieser for repeated in-depth discussion about datatype design. We are very grateful for the community input in reviewing and revising this NEP and would like to thank especially Ross Barnowski and Ralf Gommers.
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote: <snip>
If we were designing a new programming language around array computing principles, I do think that would be the approach I would want to take/consider. But I simply lack the vision of how marrying the idea with the scalar language Python would work out well...
I have had a glance at what you are after, and it seems challenging indeed. IMO, trying to cope with everybody's need regarding data types is extremely costly (or more simply, not even possible). I think that a better approach would be to decouple the storage part of a container from its data type system. In the storage there should go things needed to cope with the data retrieval, like the itemsize, the shape, or even other sub-shapes for chunked datasets. Then, in the data type layer, one should be able to add meaning to the raw data: is that an integer? speed? temperature? a compound type?
I am struggling a bit fully understand the lessons to learn.
There seems some overlap of storage and DTypes? That is mainly `itemsize` and more tricky `is/has_object`. Which is about how the data is stored but depends on which data is stored? In my current view these are part of the `dtype` instance, e.g. the class `np.dtype[np.string]` (a DTypeMeta instance), will have instances: `np.dtype[np.string](length=5, byteorder="=")` (which is identical to `np.dtype("U5")`).
Or is it that `np.ndarray` would actually use an `np.naivearray` internally, which is told the itemsize at construction time? In principle, the DType class could also be much more basic, and NumPy could subclass it (or something similar) to tag on the things it needs to efficiently use the DTypes (outside the computational engine/UFuncs, which cover a lot, but unfortunately I do not think everything).
- Sebastian
Indeed the data storage layer should be able to provide a way to store the data type representation so that a container can be serialized and deserialized correctly. But the important thing here is that this decoupling between storage and types allows for different data type systems, so that anyone can come with a specific type system depending on her needs. One can envision here even a basic data type system (e.g. a version of what's now supported in NumPy) that can be extended with other layers, depending on the needs, so that every community can interchange data at a basic level at least.
As an example, this is the basic spirit behind the under-construction Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( https://github.com/Blosc/C-Blosc2) will be providing the low-level storage layer, with no information about dimensionality. Caterva will be building the multidimensional layer, but with no information about the types at all. On top of this scaffold, third-party layers will be free to build their own data dtypes, specific for every domain (the concept is imaged in slide 18 of this presentation: https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing to prevent to add more layers, or even a two-layer (and preferably no more than two-level) data type system: one for simple data types (e.g. NumPy ones) and one meant to be more domain-specific.
Right now, I think that the limitation that is keeping the NumPy community thinking in terms of blending storage and types in the same layer is that NumPy is providing a computational engine too, and for doing computations, one need to provide both storage (including dimensionality info) and type information indeed. By using the multi-layer approach, there should be a computational layer that is laid out on top of the storage and the type layers, and hence, specific for leveraging them. Sure, that is a big departure from what we are used to, but as long as one can keep the architecture of the different layers simple, one could see interesting results in not that long time.
Just my 2 cents, Francesc
<snip>
On Mon, Mar 23, 2020 at 9:49 PM Sebastian Berg sebastian@sipsolutions.net wrote:
On Mon, 2020-03-23 at 18:23 +0100, Francesc Alted wrote:
<snip> > > If we were designing a new programming language around array > > computing > > principles, I do think that would be the approach I would want to > > take/consider. But I simply lack the vision of how marrying the > > idea > > with the scalar language Python would work out well... > > > > I have had a glance at what you are after, and it seems challenging > indeed. IMO, trying to cope with everybody's need regarding data > types is > extremely costly (or more simply, not even possible). I think that a > better approach would be to decouple the storage part of a container > from > its data type system. In the storage there should go things needed > to cope > with the data retrieval, like the itemsize, the shape, or even other > sub-shapes for chunked datasets. Then, in the data type layer, one > should > be able to add meaning to the raw data: is that an integer? speed? > temperature? a compound type? >
I am struggling a bit fully understand the lessons to learn.
There seems some overlap of storage and DTypes? That is mainly `itemsize` and more tricky `is/has_object`. Which is about how the data is stored but depends on which data is stored? In my current view these are part of the `dtype` instance, e.g. the class `np.dtype[np.string]` (a DTypeMeta instance), will have instances: `np.dtype[np.string](length=5, byteorder="=")` (which is identical to `np.dtype("U5")`).
Or is it that `np.ndarray` would actually use an `np.naivearray` internally, which is told the itemsize at construction time? In principle, the DType class could also be much more basic, and NumPy could subclass it (or something similar) to tag on the things it needs to efficiently use the DTypes (outside the computational engine/UFuncs, which cover a lot, but unfortunately I do not think everything).
What I am trying to say is that NumPy should be rather agnostic about providing data types beyond the relatively simple set that already supports. I am suggesting that focusing on providing a way to allow the storage (not only in-memory, but also persisted arrays via .npy/.npz files) of user-defined data types (or any other kind of metadata) and let 3rd party libraries use this machinery to serialize/deserialize them might be a better use of resources.
I am envisioning making life easier for libraries like e.g. xarray, which already extends NumPy in a number of ways, and that can make use of computational kernels different than NumPy itself (dask, probably numba too) in order to implement functionality not present in NumPy. Allowing an easy way to serialize library-defined data types would open the door to use NumPy itself as a storage layer for persistency too, bringing an important complement to NetCDF or zarr formats (remember that every format comes with its own pros and cons).
But xarray is just an example; why not thinking on other kind of libraries that would provide their own types, leveraging NumPy for storage and e.g. numba for building a library of efficient functions, specific for the new types? If done properly, these datasets can still be shared efficiently with other libraries, as long as the basic data type system existing in NumPy is used to access to it.
Cheers, Francesc
- Sebastian
Indeed the data storage layer should be able to provide a way to store the data type representation so that a container can be serialized and deserialized correctly. But the important thing here is that this decoupling between storage and types allows for different data type systems, so that anyone can come with a specific type system depending on her needs. One can envision here even a basic data type system (e.g. a version of what's now supported in NumPy) that can be extended with other layers, depending on the needs, so that every community can interchange data at a basic level at least.
As an example, this is the basic spirit behind the under-construction Caterva array container (https://github.com/Blosc/Caterva). Blosc2 ( https://github.com/Blosc/C-Blosc2) will be providing the low-level storage layer, with no information about dimensionality. Caterva will be building the multidimensional layer, but with no information about the types at all. On top of this scaffold, third-party layers will be free to build their own data dtypes, specific for every domain (the concept is imaged in slide 18 of this presentation: https://blosc.org/docs/Caterva-HDF5-Workshop.pdf). There is nothing to prevent to add more layers, or even a two-layer (and preferably no more than two-level) data type system: one for simple data types (e.g. NumPy ones) and one meant to be more domain-specific.
Right now, I think that the limitation that is keeping the NumPy community thinking in terms of blending storage and types in the same layer is that NumPy is providing a computational engine too, and for doing computations, one need to provide both storage (including dimensionality info) and type information indeed. By using the multi-layer approach, there should be a computational layer that is laid out on top of the storage and the type layers, and hence, specific for leveraging them. Sure, that is a big departure from what we are used to, but as long as one can keep the architecture of the different layers simple, one could see interesting results in not that long time.
Just my 2 cents, Francesc
<snip> _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 24/3/20 11:48 am, Francesc Alted wrote:
What I am trying to say is that NumPy should be rather agnostic about providing data types beyond the relatively simple set that already supports. I am suggesting that focusing on providing a way to allow the storage (not only in-memory, but also persisted arrays via .npy/.npz files) of user-defined data types (or any other kind of metadata) and let 3rd party libraries use this machinery to serialize/deserialize them might be a better use of resources.
... Cheers, Francesc
I agree that the goal is to enable user-defined data types, and even make the creation of them from python possible (with some caveats about performance). But I think this should be done in steps, and as the subject line says this is the first step. There are many scary details to work out around the problems of promotion and casting, what to do when the output might overflow, how to mark missing values and more. The question at hand is, as I understand it, one of finding the right way to create a data type object that will enable exactly what you propose. I think this is the correct path, as most large refactor-in-one-step efforts I have seem leave both the old code and the new code in an unusable state for years until the bugs are worked out.
As for serialization protocols: I think that is a separate issue. We already have the npy/npz protocol, PEP3118 buffer protocol, and the pickle 5 buffering protocol. Each of them handle user-defined data types in different ways, with differing amounts of success.
Matti
On Tue, Mar 24, 2020 at 12:12 PM Matti Picus matti.picus@gmail.com wrote:
On 24/3/20 11:48 am, Francesc Alted wrote:
What I am trying to say is that NumPy should be rather agnostic about providing data types beyond the relatively simple set that already supports. I am suggesting that focusing on providing a way to allow the storage (not only in-memory, but also persisted arrays via .npy/.npz files) of user-defined data types (or any other kind of metadata) and let 3rd party libraries use this machinery to serialize/deserialize them might be a better use of resources.
... Cheers, Francesc
I agree that the goal is to enable user-defined data types, and even make the creation of them from python possible (with some caveats about performance). But I think this should be done in steps, and as the subject line says this is the first step. There are many scary details to work out around the problems of promotion and casting, what to do when the output might overflow, how to mark missing values and more. The question at hand is, as I understand it, one of finding the right way to create a data type object that will enable exactly what you propose. I think this is the correct path, as most large refactor-in-one-step efforts I have seem leave both the old code and the new code in an unusable state for years until the bugs are worked out.
Thanks Matti for clarifying the goals of the NEP; having the sentence "New Datatype System" in the title sounded scary to my ears indeed, and I share your concerns about new code largely undergoing 'beta' stage for long time. Before shutting up, I'll just reiterate that providing pretty shallow machinery for allowing the integration with user-defined data types should avoid big headaches: the simpler, the better. But this is of course up to the maintainers.
As for serialization protocols: I think that is a separate issue. We already have the npy/npz protocol, PEP3118 buffer protocol, and the pickle 5 buffering protocol. Each of them handle user-defined data types in different ways, with differing amounts of success.
Yup, I forgot the buffer protocol an pickle 5. Thanks for reminder.
Cheers,
Hi all,
I propose to officially accepting NEP 41:
"First step towards a new Datatype System"
If you have any concerns please let me know or discuss here within a week. If there are no concerns voiced the NEP may be accepted. I realize that there may be some who need time to think about this individually and will of course wait, but at this time I hope that no larger discussion on the mailing list will be necessary.
Again, the main immediate effect/design choice is that there will be classes for each NumPy dtype:
float64 = np.dtype("float64") # Native byteorder float64 Float64DType = type(float64) # np.dtype[float64] issubclass(Float64DType, np.dtype) # True isinstance(float64, np.dtype) # True (as before)
And in the above `float64.newbyteorder()` will also be an instance of the same `Float64DType` class. As such the class `Float64DType` in the above represents what is currently represented by the type number: `float64.num`
This does admittedly mean that `Float64DType` effectively is a class with only a singleton instance in most cases, since non-native byte order or metadata are rarely used. Multiple instances are mainly necessary for datatypes such as current strings (with varying length) or datetimes (with a unit). There are probably alternatives and the boundaries between instances and can be drawn at different places (even within this framework), but I believe that it is the practical and intuitive approach to draw them at the current type numbers.
Best,
Sebastian
On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:
Hi all,
I am pleased to propose NEP 41: First step towards a new Datatype System https://numpy.org/neps/nep-0041-improved-dtype-support.html
This NEP motivates the larger restructure of the datatype machinery in NumPy and defines a few fundamental design aspects. The long term user impact will be allowing easier and more rich featured user defined datatypes.
As this is a large restructure, the NEP represents only the first steps with some additional information in further NEPs being drafted [1] (this may be helpful to look at depending on the level of detail you are interested in). The NEP itself does not propose to add significant new public API. Instead it proposes to move forward with an incremental internal refactor and lays the foundation for this process.
The main user facing change at this time is that datatypes will become classes (e.g. ``type(np.dtype("float64"))`` will be a float64 specific class. For most users, the main impact should be many new datatypes in the long run (see the user impact section). However, for those interested in API design within NumPy or with respect to implementing new datatypes, this and the following NEPs are important decisions in the future roadmap for NumPy.
The current full text is reproduced below, although the above link is probably a better way to read it.
Cheers
Sebastian
[1] NEP 40 gives some background information about the current systems and issues with it: https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac... and NEP 42 being a first draft of how the new API may look like:
https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3... (links to current rendered versions, check https://github.com/numpy/numpy/pull/15505 and https://github.com/numpy/numpy/pull/15507 for updates)
<snip>