[Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System
Sebastian Berg
sebastian at sipsolutions.net
Tue Mar 17 16:02:43 EDT 2020
Hi all,
in the spirit of trying to keep this moving, can I assume that the main
reason for little discussion is that the actual changes proposed are
not very far reaching as of now? Or is the reason that this is a
fairly complex topic that you need more time to think about it?
If it is the latter, is there some way I can help with it? I tried to
minimize how much is part of this initial NEP.
If there is not much need for discussion, I would like to officially
accept the NEP very soon, sending out an official one week notice in
the next days.
To summarize one more time, the main point is that:
type(np.dtype(np.float64))
will be `np.dtype[float64]`, a subclass of dtype, so that:
issubclass(np.dtype[float64], np.dtype)
is true. This means that we will have one class for every current type
number: `dtype.num`. The implementation of these subclasses will be a
C-written (extension) MetaClass, all details of this class are supposed
to remain experimental in flux at this time.
Cheers
Sebastian
On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:
> Hi all,
>
> I am pleased to propose NEP 41: First step towards a new Datatype
> System https://numpy.org/neps/nep-0041-improved-dtype-support.html
>
> This NEP motivates the larger restructure of the datatype machinery
> in
> NumPy and defines a few fundamental design aspects. The long term
> user
> impact will be allowing easier and more rich featured user defined
> datatypes.
>
> As this is a large restructure, the NEP represents only the first
> steps
> with some additional information in further NEPs being drafted [1]
> (this may be helpful to look at depending on the level of detail you
> are interested in).
> The NEP itself does not propose to add significant new public API.
> Instead it proposes to move forward with an incremental internal
> refactor and lays the foundation for this process.
>
> The main user facing change at this time is that datatypes will
> become
> classes (e.g. ``type(np.dtype("float64"))`` will be a float64
> specific
> class.
> For most users, the main impact should be many new datatypes in the
> long run (see the user impact section). However, for those interested
> in API design within NumPy or with respect to implementing new
> datatypes, this and the following NEPs are important decisions in the
> future roadmap for NumPy.
>
> The current full text is reproduced below, although the above link is
> probably a better way to read it.
>
> Cheers
>
> Sebastian
>
>
> [1] NEP 40 gives some background information about the current
> systems
> and issues with it:
> https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst
> and NEP 42 being a first draft of how the new API may look like:
>
> https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst
> (links to current rendered versions, check
> https://github.com/numpy/numpy/pull/15505 and
> https://github.com/numpy/numpy/pull/15507 for updates)
>
>
> -------------------------------------------------------------------
> ---
>
>
> =================================================
> NEP 41 — First step towards a new Datatype System
> =================================================
>
> :title: Improved Datatype Support
> :Author: Sebastian Berg
> :Author: Stéfan van der Walt
> :Author: Matti Picus
> :Status: Draft
> :Type: Standard Track
> :Created: 2020-02-03
>
>
> .. note::
>
> This NEP is part of a series of NEPs encompassing first
> information
> about the previous dtype implementation and issues with it in NEP
> 40.
> NEP 41 (this document) then provides an overview and generic
> design
> choices for the refactor.
> Further NEPs 42 and 43 go into the technical details of the
> datatype
> and universal function related internal and external API changes.
> In some cases it may be necessary to consult the other NEPs for a
> full
> picture of the desired changes and why these changes are
> necessary.
>
>
> Abstract
> --------
>
> `Datatypes <data-type-objects-dtype>` in NumPy describe how to
> interpret each
> element in arrays. NumPy provides ``int``, ``float``, and ``complex``
> numerical
> types, as well as string, datetime, and structured datatype
> capabilities.
> The growing Python community, however, has need for more diverse
> datatypes.
> Examples are datatypes with unit information attached (such as
> meters) or
> categorical datatypes (fixed set of possible values).
> However, the current NumPy datatype API is too limited to allow the
> creation
> of these.
>
> This NEP is the first step to enable such growth; it will lead to
> a simpler development path for new datatypes.
> In the long run the new datatype system will also support the
> creation
> of datatypes directly from Python rather than C.
> Refactoring the datatype API will improve maintainability and
> facilitate
> development of both user-defined external datatypes,
> as well as new features for existing datatypes internal to NumPy.
>
>
> Motivation and Scope
> --------------------
>
> .. seealso::
>
> The user impact section includes examples of what kind of new
> datatypes
> will be enabled by the proposed changes in the long run.
> It may thus help to read these section out of order.
>
> Motivation
> ^^^^^^^^^^
>
> One of the main issues with the current API is the definition of
> typical
> functions such as addition and multiplication for parametric
> datatypes
> (see also NEP 40) which require additional steps to determine the
> output type.
> For example when adding two strings of length 4, the result is a
> string
> of length 8, which is different from the input.
> Similarly, a datatype which embeds a physical unit must calculate the
> new unit
> information: dividing a distance by a time results in a speed.
> A related difficulty is that the :ref:`current casting rules
> <_ufuncs.casting>`
> -- the conversion between different datatypes --
> cannot describe casting for such parametric datatypes implemented
> outside of NumPy.
>
> This additional functionality for supporting parametric datatypes
> introduces
> increased complexity within NumPy itself,
> and furthermore is not available to external user-defined datatypes.
> In general the concerns of different datatypes are not well well-
> encapsulated.
> This burden is exacerbated by the exposure of internal C structures,
> limiting the addition of new fields
> (for example to support new sorting methods [new_sort]_).
>
> Currently there are many factors which limit the creation of new
> user-defined
> datatypes:
>
> * Creating casting rules for parametric user-defined dtypes is either
> impossible
> or so complex that it has never been attempted.
> * Type promotion, e.g. the operation deciding that adding float and
> integer
> values should return a float value, is very valuable for numeric
> datatypes
> but is limited in scope for user-defined and especially parametric
> datatypes.
> * Much of the logic (e.g. promotion) is written in single functions
> instead of being split as methods on the datatype itself.
> * In the current design datatypes cannot have methods that do not
> generalize
> to other datatypes. For example a unit datatype cannot have a
> ``.to_si()`` method to
> easily find the datatype which would represent the same values in
> SI units.
>
> The large need to solve these issues has driven the scientific
> community
> to create work-arounds in multiple projects implementing physical
> units as an
> array-like class instead of a datatype, which would generalize better
> across
> multiple array-likes (Dask, pandas, etc.).
> Already, Pandas has made a push into the same direction with its
> extension arrays [pandas_extension_arrays]_ and undoubtedly
> the community would be best served if such new features could be
> common
> between NumPy, Pandas, and other projects.
>
> Scope
> ^^^^^
>
> The proposed refactoring of the datatype system is a large
> undertaking and
> thus is proposed to be split into various phases, roughly:
>
> * Phase I: Restructure and extend the datatype infrastructure (This
> NEP 41)
> * Phase II: Incrementally define or rework API (Detailed largely in
> NEPs 42/43)
> * Phase III: Growth of NumPy and Scientific Python Ecosystem
> capabilities.
>
> For a more detailed accounting of the various phases, see
> "Plan to Approach the Full Refactor" in the Implementation section
> below.
> This NEP proposes to move ahead with the necessary creation of new
> dtype
> subclasses (Phase I),
> and start working on implementing current functionality.
> Within the context of this NEP all development will be fully private
> API or
> use preliminary underscored names which must be changed in the
> future.
> Most of the internal and public API choices are part of a second
> Phase
> and will be discussed in more detail in the following NEPs 42 and 43.
> The initial implementation of this NEP will have little or no effect
> on users,
> but provides the necessary ground work for incrementally addressing
> the
> full rework.
>
> The implementation of this NEP and the following, implied large
> rework of how
> datatypes are defined in NumPy is expected to create small
> incompatibilities
> (see backward compatibility section).
> However, a transition requiring large code adaption is not
> anticipated and not
> within scope.
>
> Specifically, this NEP makes the following design choices which are
> discussed
> in more details in the detailed description section:
>
> 1. Each datatype will be an instance of a subclass of ``np.dtype``,
> with most of the
> datatype-specific logic being implemented
> as special methods on the class. In the C-API, these correspond to
> specific
> slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,
> np.dtype)`` will remain true,
> but ``type(f)`` will be a subclass of ``np.dtype`` rather than
> just ``np.dtype`` itself.
> The ``PyArray_ArrFuncs`` which are currently stored as a pointer
> on the instance (as ``PyArray_Descr->f``),
> should instead be stored on the class as typically done in Python.
> In the future these may correspond to python side dunder methods.
> Storage information such as itemsize and byteorder can differ
> between
> different dtype instances (e.g. "S3" vs. "S8") and will remain
> part of the instance.
> This means that in the long run the current lowlevel access to
> dtype methods
> will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
>
> 2. The current NumPy scalars will *not* change, they will not be
> instances of
> datatypes. This will also be true for new datatypes, scalars will
> not be
> instances of a dtype (although ``isinstance(scalar, dtype)`` may
> be made
> to return ``True`` when appropriate).
>
> Detailed technical decisions to follow in NEP 42.
>
> Further, the public API will be designed in a way that is extensible
> in the future:
>
> 3. All new C-API functions provided to the user will hide
> implementation details
> as much as possible. The public API should be an identical, but
> limited,
> version of the C-API used for the internal NumPy datatypes.
>
> The changes to the datatype system in Phase II must include a large
> refactor of the
> UFunc machinery, which will be further defined in NEP 43:
>
> 4. To enable all of the desired functionality for new user-defined
> datatypes,
> the UFunc machinery will be changed to replace the current
> dispatching
> and type resolution system.
> The old system should be *mostly* supported as a legacy version
> for some time.
>
> Additionally, as a general design principle, the addition of new
> user-defined
> datatypes will *not* change the behaviour of programs.
> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or
> ``b`` know
> that ``c`` exists.
>
>
> User Impact
> -----------
>
> The current ecosystem has very few user-defined datatypes using
> NumPy, the
> two most prominent being: ``rational`` and ``quaternion``.
> These represent fairly simple datatypes which are not strongly
> impacted
> by the current limitations.
> However, we have identified a need for datatypes such as:
>
> * bfloat16, used in deep learning
> * categorical types
> * physical units (such as meters)
> * datatypes for tracing/automatic differentiation
> * high, fixed precision math
> * specialized integer types such as int2, int24
> * new, better datetime representations
> * extending e.g. integer dtypes to have a sentinel NA value
> * geometrical objects [pygeos]_
>
> Some of these are partially solved; for example unit capability is
> provided
> in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray`
> subclasses.
> Most of these datatypes, however, simply cannot be reasonably defined
> right now.
> An advantage of having such datatypes in NumPy is that they should
> integrate
> seamlessly with other array or array-like packages such as Pandas,
> ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
>
> The long term user impact of implementing this NEP will be to allow
> both
> the growth of the whole ecosystem by having such new datatypes, as
> well as
> consolidating implementation of such datatypes within NumPy to
> achieve
> better interoperability.
>
>
> Examples
> ^^^^^^^^
>
> The following examples represent future user-defined datatypes we
> wish to enable.
> These datatypes are not part the NEP and choices (e.g. choice of
> casting rules)
> are possibilities we wish to enable and do not represent
> recommendations.
>
> Simple Numerical Types
> """"""""""""""""""""""
>
> Mainly used where memory is a consideration, lower-precision numeric
> types
> such as :ref:```bfloat16`` <
> https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`
> are common in other computational frameworks.
> For these types the definitions of things such as ``np.common_type``
> and
> ``np.can_cast`` are some of the most important interfaces. Once they
> support ``np.common_type``, it is (for the most part) possible to
> find
> the correct ufunc loop to call, since most ufuncs -- such as add --
> effectively
> only require ``np.result_type``::
>
> >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
>
> and `~numpy.result_type` is largely identical to
> `~numpy.common_type`.
>
>
> Fixed, high precision math
> """"""""""""""""""""""""""
>
> Allowing arbitrary precision or higher precision math is important in
> simulations. For instance ``mpmath`` defines a precision::
>
> >>> import mpmath as mp
> >>> print(mp.dps) # the current (default) precision
> 15
>
> NumPy should be able to construct a native, memory-efficient array
> from
> a list of ``mpmath.mpf`` floating point objects::
>
> >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a
> list)
> >>> print(arr_15_dps) # Must find the correct precision from the
> objects:
> array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
>
> We should also be able to specify the desired precision when
> creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]``
> to find the DType class (the notation is not part of this NEP),
> which is then instantiated with the desired parameter.
> This could also be written as ``MpfDType`` class::
>
> >>> arr_100_dps = np.array([1, 2, 3],
> dtype=np.dtype[mp.mpf](dps=100))
> >>> print(arr_15_dps + arr_100_dps)
> array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
>
> The ``mpf`` datatype can decide that the result of the operation
> should be the
> higher precision one of the two, so uses a precision of 100.
> Furthermore, we should be able to define casting, for example as in::
>
> >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
> casting="safe")
> True
> >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,
> casting="safe")
> False # loses precision
> >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,
> casting="same_kind")
> True
>
> Casting from float is a probably always at least a ``same_kind``
> cast, but
> in general, it is not safe::
>
> >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
> casting="safe")
> False
>
> since a float64 has a higer precision than the ``mpf`` datatype with
> ``dps=4``.
>
> Alternatively, we can say that::
>
> >>> np.common_type(np.dtype[mp.mpf](dps=5),
> np.dtype[mp.mpf](dps=10))
> np.dtype[mp.mpf](dps=10)
>
> And possibly even::
>
> >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)
> np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I
> believe)
>
> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``
> safely.
>
>
> Categoricals
> """"""""""""
>
> Categoricals are interesting in that they can have fixed, predefined
> values,
> or can be dynamic with the ability to modify categories when
> necessary.
> The fixed categories (defined ahead of time) is the most straight
> forward
> categorical definition.
> Categoricals are *hard*, since there are many strategies to implement
> them,
> suggesting NumPy should only provide the scaffolding for user-defined
> categorical types. For instance::
>
> >>> cat = Categorical(["eggs", "spam", "toast"])
> >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
> dtype=cat)
>
> could store the array very efficiently, since it knows that there are
> only 3
> categories.
> Since a categorical in this sense knows almost nothing about the data
> stored
> in it, few operations makes, sense, although equality does:
>
> >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
> dtype=cat)
> >>> breakfast == breakfast2
> array[True, False, True, False])
>
> The categorical datatype could work like a dictionary: no two
> items names can be equal (checked on dtype creation), so that the
> equality
> operation above can be performed very efficiently.
> If the values define an order, the category labels (internally
> integers) could
> be ordered the same way to allow efficient sorting and comparison.
>
> Whether or not casting is defined from one categorical with less to
> one with
> strictly more values defined, is something that the Categorical
> datatype would
> need to decide. Both options should be available.
>
>
> Unit on the Datatype
> """"""""""""""""""""
>
> There are different ways to define Units, depending on how the
> internal
> machinery would be organized, one way is to have a single Unit
> datatype
> for every existing numerical type.
> This will be written as ``Unit[float64]``, the unit itself is part of
> the
> DType instance ``Unit[float64]("m")`` is a ``float64`` with meters
> attached::
>
> >>> from astropy import units
> >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #
> meters
> >>> print(meters)
> array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
>
> Note that units are a bit tricky. It is debatable, whether::
>
> >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
>
> should be valid syntax (coercing the float scalars without a unit to
> meters).
> Once the array is created, math will work without any issue::
>
> >>> meters / (2 * unit.seconds)
> array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
>
> Casting is not valid from one unit to the other, but can be valid
> between
> different scales of the same dimensionality (although this may be
> "unsafe")::
>
> >>> meters.astype(Unit[float64]("s"))
> TypeError: Cannot cast meters to seconds.
> >>> meters.astype(Unit[float64]("km"))
> >>> # Convert to centimeter-gram-second (cgs) units:
> >>> meters.astype(meters.dtype.to_cgs())
>
> The above notation is somewhat clumsy. Functions
> could be used instead to convert between units.
> There may be ways to make these more convenient, but those must be
> left
> for future discussions::
>
> >>> units.convert(meters, "km")
> >>> units.to_cgs(meters)
>
> There are some open questions. For example, whether additional
> methods
> on the array object could exist to simplify some of the notions, and
> how these
> would percolate from the datatype to the ``ndarray``.
>
> The interaction with other scalars would likely be defined through::
>
> >>> np.common_type(np.float64, Unit)
> Unit[np.float64](dimensionless)
>
> Ufunc output datatype determination can be more involved than for
> simple
> numerical dtypes since there is no "universal" output type::
>
> >>> np.multiply(meters, seconds).dtype != np.result_type(meters,
> seconds)
>
> In fact ``np.result_type(meters, seconds)`` must error without
> context
> of the operation being done.
> This example highlights how the specific ufunc loop
> (loop with known, specific DTypes as inputs), has to be able to to
> make
> certain decisions before the actual calculation can start.
>
>
>
> Implementation
> --------------
>
> Plan to Approach the Full Refactor
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> To address these issues in NumPy and enable new datatypes,
> multiple development stages are required:
>
> * Phase I: Restructure and extend the datatype infrastructure (This
> NEP)
>
> * Organize Datatypes like normal Python classes [`PR 15508`]_
>
> * Phase II: Incrementally define or rework API
>
> * Create a new and easily extensible API for defining new datatypes
> and related functionality. (NEP 42)
>
> * Incrementally define all necessary functionality through the new
> API (NEP 42):
>
> * Defining operations such as ``np.common_type``.
> * Allowing to define casting between datatypes.
> * Add functionality necessary to create a numpy array from Python
> scalars
> (i.e. ``np.array(...)``).
> * …
>
> * Restructure how universal functions work (NEP 43), in order to:
>
> * make it possible to allow a `~numpy.ufunc` such as ``np.add``
> to be
> extended by user-defined datatypes such as Units.
>
> * allow efficient lookup for the correct implementation for user-
> defined
> datatypes.
>
> * enable reuse of existing code. Units should be able to use the
> normal math loops and add additional logic to determine output
> type.
>
> * Phase III: Growth of NumPy and Scientific Python Ecosystem
> capabilities:
>
> * Cleanup of legacy behaviour where it is considered buggy or
> undesirable.
> * Provide a path to define new datatypes from Python.
> * Assist the community in creating types such as Units or
> Categoricals
> * Allow strings to be used in functions such as ``np.equal`` or
> ``np.add``.
> * Remove legacy code paths within NumPy to improve long term
> maintainability
>
> This document serves as a basis for phase I and provides the vision
> and
> motivation for the full project.
> Phase I does not introduce any new user-facing features,
> but is concerned with the necessary conceptual cleanup of the current
> datatype system.
> It provides a more "pythonic" datatype Python type object, with a
> clear class hierarchy.
>
> The second phase is the incremental creation of all APIs necessary to
> define
> fully featured datatypes and reorganization of the NumPy datatype
> system.
> This phase will thus be primarily concerned with defining an,
> initially preliminary, stable public API.
>
> Some of the benefits of a large refactor may only become evident
> after the full
> deprecation of the current legacy implementation (i.e. larger code
> removals).
> However, these steps are necessary for improvements to many parts of
> the
> core NumPy API, and are expected to make the implementation generally
> easier to understand.
>
> The following figure illustrates the proposed design at a high level,
> and roughly delineates the components of the overall design.
> Note that this NEP only regards Phase I (shaded area),
> the rest encompasses Phase II and the design choices are up for
> discussion,
> however, it highlights that the DType datatype class is the central,
> necessary
> concept:
>
> .. image:: _static/nep-0041-mindmap.svg
>
>
> First steps directly related to this NEP
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The required changes necessary to NumPy are large and touch many
> areas
> of the code base
> but many of these changes can be addressed incrementally.
>
> To enable an incremental approach we will start by creating a C
> defined
> ``PyArray_DTypeMeta`` class with its instances being the ``DType``
> classes,
> subclasses of ``np.dtype``.
> This is necessary to add the ability of storing custom slots on the
> DType in C.
> This ``DTypeMeta`` will be implemented first to then enable
> incremental
> restructuring of current code.
>
> The addition of ``DType`` will then enable addressing other changes
> incrementally, some of which may begin before the settling the full
> internal
> API:
>
> 1. New machinery for array coercion, with the goal of enabling user
> DTypes
> with appropriate class methods.
> 2. The replacement or wrapping of the current casting machinery.
> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots
> into
> DType method slots.
>
> At this point, no or only very limited new public API will be added
> and
> the internal API is considered to be in flux.
> Any new public API may be set up give warnings and will have leading
> underscores
> to indicate that it is not finalized and can be changed without
> warning.
>
>
> Backward compatibility
> ----------------------
>
> While the actual backward compatibility impact of implementing Phase
> I and II
> are not yet fully clear, we anticipate, and accept the following
> changes:
>
> * **Python API**:
>
> * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
> while right
> now ``type(np.dtype("f8")) is np.dtype``.
> Code should use ``isinstance`` checks, and in very rare cases may
> have to
> be adapted to use it.
>
> * **C-API**:
>
> * In old versions of NumPy ``PyArray_DescrCheck`` is a macro
> which uses
> ``type(dtype) is np.dtype``. When compiling against an old
> NumPy version,
> the macro may have to be replaced with the corresponding
> ``PyObject_IsInstance`` call. (If this is a problem, we could
> backport
> fixing the macro)
>
> * The UFunc machinery changes will break *limited* parts of the
> current
> implementation. Replacing e.g. the default ``TypeResolver`` is
> expected
> to remain supported for a time, although optimized masked inner
> loop iteration
> (which is not even used *within* NumPy) will no longer be
> supported.
>
> * All functions currently defined on the dtypes, such as
> ``PyArray_Descr->f->nonzero``, will be defined and accessed
> differently.
> This means that in the long run lowlevel access code will
> have to be changed to use the new API. Such changes are expected
> to be
> necessary in very few project.
>
> * **dtype implementors (C-API)**:
>
> * The array which is currently provided to some functions (such as
> cast functions),
> will no longer be provided.
> For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-
> >copyswapn``,
> may instead receive a dummy array object with only some fields
> (mainly the
> dtype), being valid.
> At least in some code paths, a similar mechanism is already used.
>
> * The ``scalarkind`` slot and registration of scalar casting will
> be
> removed/ignored without replacement.
> It currently allows partial value-based casting.
> The ``PyArray_ScalarKind`` function will continue to work for
> builtin types,
> but will not be used internally and be deprecated.
>
> * Currently user dtypes are defined as instances of ``np.dtype``.
> The creation works by the user providing a prototype instance.
> NumPy will need to modify at least the type during registration.
> This has no effect for either ``rational`` or ``quaternion`` and
> mutation
> of the structure seems unlikely after registration.
>
> Since there is a fairly large API surface concerning datatypes,
> further changes
> or the limitation certain function to currently existing datatypes is
> likely to occur.
> For example functions which use the type number as input
> should be replaced with functions taking DType classes instead.
> Although public, large parts of this C-API seem to be used rarely,
> possibly never, by downstream projects.
>
>
>
> Detailed Description
> --------------------
>
> This section details the design decisions covered by this NEP.
> The subsections correspond to the list of design choices presented
> in the Scope section.
>
> Datatypes as Python Classes (1)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The current NumPy datatypes are not full scale python classes.
> They are instead (prototype) instances of a single ``np.dtype``
> class.
> Changing this means that any special handling, e.g. for ``datetime``
> can be moved to the Datetime DType class instead, away from
> monolithic general
> code (e.g. current ``PyArray_AdjustFlexibleDType``).
>
> The main consequence of this change with respect to the API is that
> special methods move from the dtype instances to methods on the new
> DType class.
> This is the typical design pattern used in Python.
> Organizing these methods and information in a more Pythonic way
> provides a
> solid foundation for refining and extending the API in the future.
> The current API cannot be extended due to how it is exposed
> publically.
> This means for example that the methods currently stored in
> ``PyArray_ArrFuncs``
> on each datatype (see NEP 40) will be defined differently in the
> future and
> deprecated in the long run.
>
> The most prominent visible side effect of this will be that
> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.
> Instead it will be a subclass of ``np.dtype`` meaning that
> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.
> This will also add the ability to use ``isinstance(dtype,
> np.dtype[float64])``
> thus removing the need to use ``dtype.kind``, ``dtype.char``, or
> ``dtype.type``
> to do this check.
>
> With the design decision of DTypes as full-scale Python classes,
> the question of subclassing arises.
> Inheritance, however, appears problematic and a complexity best
> avoided
> (at least initially) for container datatypes.
> Further, subclasses may be more interesting for interoperability for
> example with GPU backends (CuPy) storing additional methods related
> to the
> GPU rather than as a mechanism to define new datatypes.
> A class hierarchy does provides value, this may be achieved by
> allowing the creation of *abstract* datatypes.
> An example for an abstract datatype would be the datatype equivalent
> of
> ``np.floating``, representing any floating point number.
> These can serve the same purpose as Python's abstract base classes.
>
>
> Scalars should not be instances of the datatypes (2)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> For simple datatypes such as ``float64`` (see also below), it seems
> tempting that the instance of a ``np.dtype("float64")`` can be the
> scalar.
> This idea may be even more appealing due to the fact that scalars,
> rather than datatypes, currently define a useful type hierarchy.
>
> However, we have specifically decided against this for a number of
> reasons.
> First, the new datatypes described herein would be instances of DType
> classes.
> Making these instances themselves classes, while possible, adds
> additional
> complexity that users need to understand.
> It would also mean that scalars must have storage information (such
> as byteorder)
> which is generally unnecessary and currently is not used.
> Second, while the simple NumPy scalars such as ``float64`` may be
> such instances,
> it should be possible to create datatypes for Python objects without
> enforcing
> NumPy as a dependency.
> However, Python objects that do not depend on NumPy cannot be
> instances of a NumPy DType.
> Third, there is a mismatch between the methods and attributes which
> are useful
> for scalars and datatypes. For instance ``to_float()`` makes sense
> for a scalar
> but not for a datatype and ``newbyteorder`` is not useful on a scalar
> (or has
> a different meaning).
>
> Overall, it seem rather than reducing the complexity, i.e. by merging
> the two distinct type hierarchies, making scalars instances of DTypes
> would
> increase the complexity of both the design and implementation.
>
> A possible future path may be to instead simplify the current NumPy
> scalars to
> be much simpler objects which largely derive their behaviour from the
> datatypes.
>
> C-API for creating new Datatypes (3)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The current C-API with which users can create new datatypes
> is limited in scope, and requires use of "private" structures. This
> means
> the API is not extensible: no new members can be added to the
> structure
> without losing binary compatibility.
> This has already limited the inclusion of new sorting methods into
> NumPy [new_sort]_.
>
> The new version shall thus replace the current ``PyArray_ArrFuncs``
> structure used
> to define new datatypes.
> Datatypes that currently exist and are defined using these slots will
> be
> supported during a deprecation period.
>
> The most likely solution is to hide the implementation from the user
> and thus make
> it extensible in the future is to model the API after Python's stable
> API [PEP-384]_:
>
> .. code-block:: C
>
> static struct PyArrayMethodDef slots[] = {
> {NPY_dt_method, method_implementation},
> ...,
> {0, NULL}
> }
>
> typedef struct{
> PyTypeObject *typeobj; /* type of python scalar */
> ...;
> PyType_Slot *slots;
> } PyArrayDTypeMeta_Spec;
>
> PyObject* PyArray_InitDTypeMetaFromSpec(
> PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
> *dtype_spec);
>
> The C-side slots should be designed to mirror Python side methods
> such as ``dtype.__dtype_method__``, although the exposure to Python
> is
> a later step in the implementation to reduce the complexity of the
> initial
> implementation.
>
>
> C-API Changes to the UFunc Machinery (4)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Proposed changes to the UFunc machinery will be part of NEP 43.
> However, the following changes will be necessary (see NEP 40 for a
> detailed
> description of the current implementation and its issues):
>
> * The current UFunc type resolution must be adapted to allow better
> control
> for user-defined dtypes as well as resolve current inconsistencies.
> * The inner-loop used in UFuncs must be expanded to include a return
> value.
> Further, error reporting must be improved, and passing in dtype-
> specific
> information enabled.
> This requires the modification of the inner-loop function signature
> and
> addition of new hooks called before and after the inner-loop is
> used.
>
> An important goal for any changes to the universal functions will be
> to
> allow the reuse of existing loops.
> It should be easy for a new units datatype to fall back to existing
> math
> functions after handling the unit related computations.
>
>
> Discussion
> ----------
>
> See NEP 40 for a list of previous meetings and discussions.
>
>
> References
> ----------
>
> .. [pandas_extension_arrays]
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types
>
> .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
>
> .. [pygeos] https://github.com/caspervdw/pygeos
>
> .. [new_sort] https://github.com/numpy/numpy/pull/12945
>
> .. [PEP-384] https://www.python.org/dev/peps/pep-0384/
>
> .. [PR 15508] https://github.com/numpy/numpy/pull/15508
>
>
> Copyright
> ---------
>
> This document has been placed in the public domain.
>
>
> Acknowledgments
> ---------------
>
> The effort to create new datatypes for NumPy has been discussed for
> several
> years in many different contexts and settings, making it impossible
> to list everyone involved.
> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and
> Eric Wieser
> for repeated in-depth discussion about datatype design.
> We are very grateful for the community input in reviewing and
> revising this
> NEP and would like to thank especially Ross Barnowski and Ralf
> Gommers.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200317/1babf472/attachment-0001.sig>
More information about the NumPy-Discussion
mailing list