<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg <<a href="mailto:sebastian@sipsolutions.net">sebastian@sipsolutions.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi all,<br>

<br>

in the spirit of trying to keep this moving, can I assume that the main<br>

reason for little discussion is that the actual changes proposed are<br>

not very far reaching as of now?  Or is the reason that this is a<br>

fairly complex topic that you need more time to think about it?<br></blockquote><div><br></div><div>Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents.</div><div><br></div><div>Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

If it is the latter, is there some way I can help with it?  I tried to<br>

minimize how much is part of this initial NEP.<br>

<br>

If there is not much need for discussion, I would like to officially<br>

accept the NEP very soon, sending out an official one week notice in<br>

the next days.<br></blockquote><div><br></div><div>I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


To summarize one more time, the main point is that:<br></blockquote><div><br></div><div>This point seems fine, and I'm +1 for going ahead with the described parts of the technical design.</div><div><br></div><div>Cheers,</div><div>Ralf</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

    type(np.dtype(np.float64))<br>

<br>

will be `np.dtype[float64]`, a subclass of dtype, so that:<br>

<br>

    issubclass(np.dtype[float64], np.dtype)<br>

<br>

is true. This means that we will have one class for every current type<br>

number: `dtype.num`. The implementation of these subclasses will be a<br>

C-written (extension) MetaClass, all details of this class are supposed<br>

to remain experimental in flux at this time.<br>

<br>

Cheers<br>

<br>

Sebastian<br>

<br>

<br>

On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:<br>

> Hi all,<br>

> <br>

> I am pleased to propose NEP 41: First step towards a new Datatype<br>

> System <a href="https://numpy.org/neps/nep-0041-improved-dtype-support.html" rel="noreferrer" target="_blank">https://numpy.org/neps/nep-0041-improved-dtype-support.html</a><br>

> <br>

> This NEP motivates the larger restructure of the datatype machinery<br>

> in<br>

> NumPy and defines a few fundamental design aspects. The long term<br>

> user<br>

> impact will be allowing easier and more rich featured user defined<br>

> datatypes.<br>

> <br>

> As this is a large restructure, the NEP represents only the first<br>

> steps<br>

> with some additional information in further NEPs being drafted [1]<br>

> (this may be helpful to look at depending on the level of detail you<br>

> are interested in).<br>

> The NEP itself does not propose to add significant new public API.<br>

> Instead it proposes to move forward with an incremental internal<br>

> refactor and lays the foundation for this process.<br>

> <br>

> The main user facing change at this time is that datatypes will<br>

> become<br>

> classes (e.g. ``type(np.dtype("float64"))`` will be a float64<br>

> specific<br>

> class.<br>

> For most users, the main impact should be many new datatypes in the<br>

> long run (see the user impact section). However, for those interested<br>

> in API design within NumPy or with respect to implementing new<br>

> datatypes, this and the following NEPs are important decisions in the<br>

> future roadmap for NumPy.<br>

> <br>

> The current full text is reproduced below, although the above link is<br>

> probably a better way to read it.<br>

> <br>

> Cheers<br>

> <br>

> Sebastian<br>

> <br>

> <br>

> [1] NEP 40 gives some background information about the current<br>

> systems<br>

> and issues with it:<br>

> <a href="https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst</a><br>

> and NEP 42 being a first draft of how the new API may look like:<br>

> <br>

> <a href="https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst</a><br>

> (links to current rendered versions, check <br>

> <a href="https://github.com/numpy/numpy/pull/15505" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15505</a> and <br>

> <a href="https://github.com/numpy/numpy/pull/15507" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15507</a> for updates)<br>

> <br>

> <br>

> -------------------------------------------------------------------<br>

> ---<br>

> <br>

> <br>

> =================================================<br>

> NEP 41 — First step towards a new Datatype System<br>

> =================================================<br>

> <br>

> :title: Improved Datatype Support<br>

> :Author: Sebastian Berg<br>

> :Author: Stéfan van der Walt<br>

> :Author: Matti Picus<br>

> :Status: Draft<br>

> :Type: Standard Track<br>

> :Created: 2020-02-03<br>

> <br>

> <br>

> .. note::<br>

> <br>

>     This NEP is part of a series of NEPs encompassing first<br>

> information<br>

>     about the previous dtype implementation and issues with it in NEP<br>

> 40.<br>

>     NEP 41 (this document) then provides an overview and generic<br>

> design<br>

>     choices for the refactor.<br>

>     Further NEPs 42 and 43 go into the technical details of the<br>

> datatype<br>

>     and universal function related internal and external API changes.<br>

>     In some cases it may be necessary to consult the other NEPs for a<br>

> full<br>

>     picture of the desired changes and why these changes are<br>

> necessary.<br>

> <br>

> <br>

> Abstract<br>

> --------<br>

> <br>

> `Datatypes <data-type-objects-dtype>` in NumPy describe how to<br>

> interpret each<br>

> element in arrays. NumPy provides ``int``, ``float``, and ``complex``<br>

> numerical<br>

> types, as well as string, datetime, and structured datatype<br>

> capabilities.<br>

> The growing Python community, however, has need for more diverse<br>

> datatypes.<br>

> Examples are datatypes with unit information attached (such as<br>

> meters) or<br>

> categorical datatypes (fixed set of possible values).<br>

> However, the current NumPy datatype API is too limited to allow the<br>

> creation<br>

> of these.<br>

> <br>

> This NEP is the first step to enable such growth; it will lead to<br>

> a simpler development path for new datatypes.<br>

> In the long run the new datatype system will also support the<br>

> creation<br>

> of datatypes directly from Python rather than C.<br>

> Refactoring the datatype API will improve maintainability and<br>

> facilitate<br>

> development of both user-defined external datatypes,<br>

> as well as new features for existing datatypes internal to NumPy.<br>

> <br>

> <br>

> Motivation and Scope<br>

> --------------------<br>

> <br>

> .. seealso::<br>

> <br>

>     The user impact section includes examples of what kind of new<br>

> datatypes<br>

>     will be enabled by the proposed changes in the long run.<br>

>     It may thus help to read these section out of order.<br>

> <br>

> Motivation<br>

> ^^^^^^^^^^<br>

> <br>

> One of the main issues with the current API is the definition of<br>

> typical<br>

> functions such as addition and multiplication for parametric<br>

> datatypes<br>

> (see also NEP 40) which require additional steps to determine the<br>

> output type.<br>

> For example when adding two strings of length 4, the result is a<br>

> string<br>

> of length 8, which is different from the input.<br>

> Similarly, a datatype which embeds a physical unit must calculate the<br>

> new unit<br>

> information: dividing a distance by a time results in a speed.<br>

> A related difficulty is that the :ref:`current casting rules<br>

> <_ufuncs.casting>`<br>

> -- the conversion between different datatypes --<br>

> cannot describe casting for such parametric datatypes implemented<br>

> outside of NumPy.<br>

> <br>

> This additional functionality for supporting parametric datatypes<br>

> introduces<br>

> increased complexity within NumPy itself,<br>

> and furthermore is not available to external user-defined datatypes.<br>

> In general the concerns of different datatypes are not well well-<br>

> encapsulated.<br>

> This burden is exacerbated by the exposure of internal C structures,<br>

> limiting the addition of new fields<br>

> (for example to support new sorting methods [new_sort]_).<br>

> <br>

> Currently there are many factors which limit the creation of new<br>

> user-defined<br>

> datatypes:<br>

> <br>

> * Creating casting rules for parametric user-defined dtypes is either<br>

> impossible<br>

>   or so complex that it has never been attempted.<br>

> * Type promotion, e.g. the operation deciding that adding float and<br>

> integer<br>

>   values should return a float value, is very valuable for numeric<br>

> datatypes<br>

>   but is limited in scope for user-defined and especially parametric<br>

> datatypes.<br>

> * Much of the logic (e.g. promotion) is written in single functions<br>

>   instead of being split as methods on the datatype itself.<br>

> * In the current design datatypes cannot have methods that do not<br>

> generalize<br>

>   to other datatypes. For example a unit datatype cannot have a<br>

> ``.to_si()`` method to<br>

>   easily find the datatype which would represent the same values in<br>

> SI units.<br>

> <br>

> The large need to solve these issues has driven the scientific<br>

> community<br>

> to create work-arounds in multiple projects implementing physical<br>

> units as an<br>

> array-like class instead of a datatype, which would generalize better<br>

> across<br>

> multiple array-likes (Dask, pandas, etc.).<br>

> Already, Pandas has made a push into the same direction with its<br>

> extension arrays [pandas_extension_arrays]_ and undoubtedly<br>

> the community would be best served if such new features could be<br>

> common<br>

> between NumPy, Pandas, and other projects.<br>

> <br>

> Scope<br>

> ^^^^^<br>

> <br>

> The proposed refactoring of the datatype system is a large<br>

> undertaking and<br>

> thus is proposed to be split into various phases, roughly:<br>

> <br>

> * Phase I: Restructure and extend the datatype infrastructure (This<br>

> NEP 41)<br>

> * Phase II: Incrementally define or rework API (Detailed largely in<br>

> NEPs 42/43)<br>

> * Phase III: Growth of NumPy and Scientific Python Ecosystem<br>

> capabilities.<br>

> <br>

> For a more detailed accounting of the various phases, see<br>

> "Plan to Approach the Full Refactor" in the Implementation section<br>

> below.<br>

> This NEP proposes to move ahead with the necessary creation of new<br>

> dtype<br>

> subclasses (Phase I),<br>

> and start working on implementing current functionality.<br>

> Within the context of this NEP all development will be fully private<br>

> API or<br>

> use preliminary underscored names which must be changed in the<br>

> future.<br>

> Most of the internal and public API choices are part of a second<br>

> Phase<br>

> and will be discussed in more detail in the following NEPs 42 and 43.<br>

> The initial implementation of this NEP will have little or no effect<br>

> on users,<br>

> but provides the necessary ground work for incrementally addressing<br>

> the<br>

> full rework.<br>

> <br>

> The implementation of this NEP and the following, implied large<br>

> rework of how<br>

> datatypes are defined in NumPy is expected to create small<br>

> incompatibilities<br>

> (see backward compatibility section).<br>

> However, a transition requiring large code adaption is not<br>

> anticipated and not<br>

> within scope.<br>

> <br>

> Specifically, this NEP makes the following design choices which are<br>

> discussed<br>

> in more details in the detailed description section:<br>

> <br>

> 1. Each datatype will be an instance of a subclass of ``np.dtype``,<br>

> with most of the<br>

>    datatype-specific logic being implemented<br>

>    as special methods on the class. In the C-API, these correspond to<br>

> specific<br>

>    slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,<br>

> np.dtype)`` will remain true,<br>

>    but ``type(f)`` will be a subclass of ``np.dtype`` rather than<br>

> just ``np.dtype`` itself.<br>

>    The ``PyArray_ArrFuncs`` which are currently stored as a pointer<br>

> on the instance (as ``PyArray_Descr->f``),<br>

>    should instead be stored on the class as typically done in Python.<br>

>    In the future these may correspond to python side dunder methods.<br>

>    Storage information such as itemsize and byteorder can differ<br>

> between<br>

>    different dtype instances (e.g. "S3" vs. "S8") and will remain<br>

> part of the instance.<br>

>    This means that in the long run the current lowlevel access to<br>

> dtype methods<br>

>    will be removed (see ``PyArray_ArrFuncs`` in NEP 40).<br>

> <br>

> 2. The current NumPy scalars will *not* change, they will not be<br>

> instances of<br>

>    datatypes. This will also be true for new datatypes, scalars will<br>

> not be<br>

>    instances of a dtype (although ``isinstance(scalar, dtype)`` may<br>

> be made<br>

>    to return ``True`` when appropriate).<br>

> <br>

> Detailed technical decisions to follow in NEP 42.<br>

> <br>

> Further, the public API will be designed in a way that is extensible<br>

> in the future:<br>

> <br>

> 3. All new C-API functions provided to the user will hide<br>

> implementation details<br>

>    as much as possible. The public API should be an identical, but<br>

> limited,<br>

>    version of the C-API used for the internal NumPy datatypes.<br>

> <br>

> The changes to the datatype system in Phase II must include a large<br>

> refactor of the<br>

> UFunc machinery, which will be further defined in NEP 43:<br>

> <br>

> 4. To enable all of the desired functionality for new user-defined<br>

> datatypes,<br>

>    the UFunc machinery will be changed to replace the current<br>

> dispatching<br>

>    and type resolution system.<br>

>    The old system should be *mostly* supported as a legacy version<br>

> for some time.<br>

> <br>

> Additionally, as a general design principle, the addition of new<br>

> user-defined<br>

> datatypes will *not* change the behaviour of programs.<br>

> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or<br>

> ``b`` know<br>

> that ``c`` exists.<br>

> <br>

> <br>

> User Impact<br>

> -----------<br>

> <br>

> The current ecosystem has very few user-defined datatypes using<br>

> NumPy, the<br>

> two most prominent being: ``rational`` and ``quaternion``.<br>

> These represent fairly simple datatypes which are not strongly<br>

> impacted<br>

> by the current limitations.<br>

> However, we have identified a need for datatypes such as:<br>

> <br>

> * bfloat16, used in deep learning<br>

> * categorical types<br>

> * physical units (such as meters)<br>

> * datatypes for tracing/automatic differentiation<br>

> * high, fixed precision math<br>

> * specialized integer types such as int2, int24<br>

> * new, better datetime representations<br>

> * extending e.g. integer dtypes to have a sentinel NA value<br>

> * geometrical objects [pygeos]_<br>

> <br>

> Some of these are partially solved; for example unit capability is<br>

> provided<br>

> in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray`<br>

> subclasses.<br>

> Most of these datatypes, however, simply cannot be reasonably defined<br>

> right now.<br>

> An advantage of having such datatypes in NumPy is that they should<br>

> integrate<br>

> seamlessly with other array or array-like packages such as Pandas,<br>

> ``xarray`` [xarray_dtype_issue]_, or ``Dask``.<br>

> <br>

> The long term user impact of implementing this NEP will be to allow<br>

> both<br>

> the growth of the whole ecosystem by having such new datatypes, as<br>

> well as<br>

> consolidating implementation of such datatypes within NumPy to<br>

> achieve<br>

> better interoperability.<br>

> <br>

> <br>

> Examples<br>

> ^^^^^^^^<br>

> <br>

> The following examples represent future user-defined datatypes we<br>

> wish to enable.<br>

> These datatypes are not part the NEP and choices (e.g. choice of<br>

> casting rules)<br>

> are possibilities we wish to enable and do not represent<br>

> recommendations.<br>

> <br>

> Simple Numerical Types<br>

> """"""""""""""""""""""<br>

> <br>

> Mainly used where memory is a consideration, lower-precision numeric<br>

> types<br>

> such as :ref:```bfloat16`` <<br>

> <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Bfloat16_floating-point_format</a>>`<br>

> are common in other computational frameworks.<br>

> For these types the definitions of things such as ``np.common_type``<br>

> and<br>

> ``np.can_cast`` are some of the most important interfaces. Once they<br>

> support ``np.common_type``, it is (for the most part) possible to<br>

> find<br>

> the correct ufunc loop to call, since most ufuncs -- such as add --<br>

> effectively<br>

> only require ``np.result_type``::<br>

> <br>

>     >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)<br>

> <br>

> and `~numpy.result_type` is largely identical to<br>

> `~numpy.common_type`.<br>

> <br>

> <br>

> Fixed, high precision math<br>

> """"""""""""""""""""""""""<br>

> <br>

> Allowing arbitrary precision or higher precision math is important in<br>

> simulations. For instance ``mpmath`` defines a precision::<br>

> <br>

>     >>> import mpmath as mp<br>

>     >>> print(mp.dps)  # the current (default) precision<br>

>     15<br>

> <br>

> NumPy should be able to construct a native, memory-efficient array<br>

> from<br>

> a list of ``mpmath.mpf`` floating point objects::<br>

> <br>

>     >>> arr_15_dps = np.array(mp.arange(3))  # (mp.arange returns a<br>

> list)<br>

>     >>> print(arr_15_dps)  # Must find the correct precision from the<br>

> objects:<br>

>     array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])<br>

> <br>

> We should also be able to specify the desired precision when<br>

> creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]``<br>

> to find the DType class (the notation is not part of this NEP),<br>

> which is then instantiated with the desired parameter.<br>

> This could also be written as ``MpfDType`` class::<br>

> <br>

>     >>> arr_100_dps = np.array([1, 2, 3],<br>

> dtype=np.dtype[mp.mpf](dps=100))<br>

>     >>> print(arr_15_dps + arr_100_dps)<br>

>     array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])<br>

> <br>

> The ``mpf`` datatype can decide that the result of the operation<br>

> should be the<br>

> higher precision one of the two, so uses a precision of 100.<br>

> Furthermore, we should be able to define casting, for example as in::<br>

> <br>

>     >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,<br>

> casting="safe")<br>

>     True<br>

>     >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,<br>

> casting="safe")<br>

>     False  # loses precision<br>

>     >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,<br>

> casting="same_kind")<br>

>     True<br>

> <br>

> Casting from float is a probably always at least a ``same_kind``<br>

> cast, but<br>

> in general, it is not safe::<br>

> <br>

>     >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),<br>

> casting="safe")<br>

>     False<br>

> <br>

> since a float64 has a higer precision than the ``mpf`` datatype with<br>

> ``dps=4``.<br>

> <br>

> Alternatively, we can say that::<br>

> <br>

>     >>> np.common_type(np.dtype[mp.mpf](dps=5),<br>

> np.dtype[mp.mpf](dps=10))<br>

>     np.dtype[mp.mpf](dps=10)<br>

> <br>

> And possibly even::<br>

> <br>

>     >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)<br>

>     np.dtype[mp.mpf](dps=16)  # equivalent precision to float64 (I<br>

> believe)<br>

> <br>

> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``<br>

> safely.<br>

> <br>

> <br>

> Categoricals<br>

> """"""""""""<br>

> <br>

> Categoricals are interesting in that they can have fixed, predefined<br>

> values,<br>

> or can be dynamic with the ability to modify categories when<br>

> necessary.<br>

> The fixed categories (defined ahead of time) is the most straight<br>

> forward<br>

> categorical definition.<br>

> Categoricals are *hard*, since there are many strategies to implement<br>

> them,<br>

> suggesting NumPy should only provide the scaffolding for user-defined<br>

> categorical types. For instance::<br>

> <br>

>     >>> cat = Categorical(["eggs", "spam", "toast"])<br>

>     >>> breakfast = array(["eggs", "spam", "eggs", "toast"],<br>

> dtype=cat)<br>

> <br>

> could store the array very efficiently, since it knows that there are<br>

> only 3<br>

> categories.<br>

> Since a categorical in this sense knows almost nothing about the data<br>

> stored<br>

> in it, few operations makes, sense, although equality does:<br>

> <br>

>     >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],<br>

> dtype=cat)<br>

>     >>> breakfast == breakfast2<br>

>     array[True, False, True, False])<br>

> <br>

> The categorical datatype could work like a dictionary: no two<br>

> items names can be equal (checked on dtype creation), so that the<br>

> equality<br>

> operation above can be performed very efficiently.<br>

> If the values define an order, the category labels (internally<br>

> integers) could<br>

> be ordered the same way to allow efficient sorting and comparison.<br>

> <br>

> Whether or not casting is defined from one categorical with less to<br>

> one with<br>

> strictly more values defined, is something that the Categorical<br>

> datatype would<br>

> need to decide. Both options should be available.<br>

> <br>

> <br>

> Unit on the Datatype<br>

> """"""""""""""""""""<br>

> <br>

> There are different ways to define Units, depending on how the<br>

> internal<br>

> machinery would be organized, one way is to have a single Unit<br>

> datatype<br>

> for every existing numerical type.<br>

> This will be written as ``Unit[float64]``, the unit itself is part of<br>

> the<br>

> DType instance ``Unit[float64]("m")`` is a ``float64`` with meters<br>

> attached::<br>

> <br>

>     >>> from astropy import units<br>

>     >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m  #<br>

> meters<br>

>     >>> print(meters)<br>

>     array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))<br>

> <br>

> Note that units are a bit tricky. It is debatable, whether::<br>

> <br>

>     >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))<br>

> <br>

> should be valid syntax (coercing the float scalars without a unit to<br>

> meters).<br>

> Once the array is created, math will work without any issue::<br>

> <br>

>     >>> meters / (2 * unit.seconds)<br>

>     array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))<br>

> <br>

> Casting is not valid from one unit to the other, but can be valid<br>

> between<br>

> different scales of the same dimensionality (although this may be<br>

> "unsafe")::<br>

> <br>

>     >>> meters.astype(Unit[float64]("s"))<br>

>     TypeError: Cannot cast meters to seconds.<br>

>     >>> meters.astype(Unit[float64]("km"))<br>

>     >>> # Convert to centimeter-gram-second (cgs) units:<br>

>     >>> meters.astype(meters.dtype.to_cgs())<br>

> <br>

> The above notation is somewhat clumsy. Functions<br>

> could be used instead to convert between units.<br>

> There may be ways to make these more convenient, but those must be<br>

> left<br>

> for future discussions::<br>

> <br>

>     >>> units.convert(meters, "km")<br>

>     >>> units.to_cgs(meters)<br>

> <br>

> There are some open questions. For example, whether additional<br>

> methods<br>

> on the array object could exist to simplify some of the notions, and<br>

> how these<br>

> would percolate from the datatype to the ``ndarray``.<br>

> <br>

> The interaction with other scalars would likely be defined through::<br>

> <br>

>     >>> np.common_type(np.float64, Unit)<br>

>     Unit[np.float64](dimensionless)<br>

> <br>

> Ufunc output datatype determination can be more involved than for<br>

> simple<br>

> numerical dtypes since there is no "universal" output type::<br>

> <br>

>     >>> np.multiply(meters, seconds).dtype != np.result_type(meters,<br>

> seconds)<br>

> <br>

> In fact ``np.result_type(meters, seconds)`` must error without<br>

> context<br>

> of the operation being done.<br>

> This example highlights how the specific ufunc loop<br>

> (loop with known, specific DTypes as inputs), has to be able to to<br>

> make<br>

> certain decisions before the actual calculation can start.<br>

> <br>

> <br>

> <br>

> Implementation<br>

> --------------<br>

> <br>

> Plan to Approach the Full Refactor<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> To address these issues in NumPy and enable new datatypes,<br>

> multiple development stages are required:<br>

> <br>

> * Phase I: Restructure and extend the datatype infrastructure (This<br>

> NEP)<br>

> <br>

>   * Organize Datatypes like normal Python classes [`PR 15508`]_<br>

> <br>

> * Phase II: Incrementally define or rework API<br>

> <br>

>   * Create a new and easily extensible API for defining new datatypes<br>

>     and related functionality. (NEP 42)<br>

> <br>

>   * Incrementally define all necessary functionality through the new<br>

> API (NEP 42):<br>

> <br>

>     * Defining operations such as ``np.common_type``.<br>

>     * Allowing to define casting between datatypes.<br>

>     * Add functionality necessary to create a numpy array from Python<br>

> scalars<br>

>       (i.e. ``np.array(...)``).<br>

>     * …<br>

> <br>

>   * Restructure how universal functions work (NEP 43), in order to:<br>

> <br>

>     * make it possible to allow a `~numpy.ufunc` such as ``np.add``<br>

> to be<br>

>       extended by user-defined datatypes such as Units.<br>

> <br>

>     * allow efficient lookup for the correct implementation for user-<br>

> defined<br>

>       datatypes.<br>

> <br>

>     * enable reuse of existing code. Units should be able to use the<br>

>       normal math loops and add additional logic to determine output<br>

> type.<br>

> <br>

> * Phase III: Growth of NumPy and Scientific Python Ecosystem<br>

> capabilities:<br>

> <br>

>   * Cleanup of legacy behaviour where it is considered buggy or<br>

> undesirable.<br>

>   * Provide a path to define new datatypes from Python.<br>

>   * Assist the community in creating types such as Units or<br>

> Categoricals<br>

>   * Allow strings to be used in functions such as ``np.equal`` or<br>

> ``np.add``.<br>

>   * Remove legacy code paths within NumPy to improve long term<br>

> maintainability<br>

> <br>

> This document serves as a basis for phase I and provides the vision<br>

> and<br>

> motivation for the full project.<br>

> Phase I does not introduce any new user-facing features,<br>

> but is concerned with the necessary conceptual cleanup of the current<br>

> datatype system.<br>

> It provides a more "pythonic" datatype Python type object, with a<br>

> clear class hierarchy.<br>

> <br>

> The second phase is the incremental creation of all APIs necessary to<br>

> define<br>

> fully featured datatypes and reorganization of the NumPy datatype<br>

> system.<br>

> This phase will thus be primarily concerned with defining an,<br>

> initially preliminary, stable public API.<br>

> <br>

> Some of the benefits of a large refactor may only become evident<br>

> after the full<br>

> deprecation of the current legacy implementation (i.e. larger code<br>

> removals).<br>

> However, these steps are necessary for improvements to many parts of<br>

> the<br>

> core NumPy API, and are expected to make the implementation generally<br>

> easier to understand.<br>

> <br>

> The following figure illustrates the proposed design at a high level,<br>

> and roughly delineates the components of the overall design.<br>

> Note that this NEP only regards Phase I (shaded area),<br>

> the rest encompasses Phase II and the design choices are up for<br>

> discussion,<br>

> however, it highlights that the DType datatype class is the central,<br>

> necessary<br>

> concept:<br>

> <br>

> .. image:: _static/nep-0041-mindmap.svg<br>

> <br>

> <br>

> First steps directly related to this NEP<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> The required changes necessary to NumPy are large and touch many<br>

> areas<br>

> of the code base<br>

> but many of these changes can be addressed incrementally.<br>

> <br>

> To enable an incremental approach we will start by creating a C<br>

> defined<br>

> ``PyArray_DTypeMeta`` class with its instances being the ``DType``<br>

> classes,<br>

> subclasses of ``np.dtype``.<br>

> This is necessary to add the ability of storing custom slots on the<br>

> DType in C.<br>

> This ``DTypeMeta`` will be implemented first to then enable<br>

> incremental<br>

> restructuring of current code.<br>

> <br>

> The addition of ``DType`` will then enable addressing other changes<br>

> incrementally, some of which may begin before the settling the full<br>

> internal<br>

> API:<br>

> <br>

> 1. New machinery for array coercion, with the goal of enabling user<br>

> DTypes<br>

>    with appropriate class methods.<br>

> 2. The replacement or wrapping of the current casting machinery.<br>

> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots<br>

> into<br>

>    DType method slots.<br>

> <br>

> At this point, no or only very limited new public API will be added<br>

> and<br>

> the internal API is considered to be in flux.<br>

> Any new public API may be set up give warnings and will have leading<br>

> underscores<br>

> to indicate that it is not finalized and can be changed without<br>

> warning.<br>

> <br>

> <br>

> Backward compatibility<br>

> ----------------------<br>

> <br>

> While the actual backward compatibility impact of implementing Phase<br>

> I and II<br>

> are not yet fully clear, we anticipate, and accept the following<br>

> changes:<br>

> <br>

> * **Python API**:<br>

> <br>

>   * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,<br>

> while right<br>

>     now ``type(np.dtype("f8")) is np.dtype``.<br>

>     Code should use ``isinstance`` checks, and in very rare cases may<br>

> have to<br>

>     be adapted to use it.<br>

> <br>

> * **C-API**:<br>

> <br>

>     * In old versions of NumPy ``PyArray_DescrCheck`` is a macro<br>

> which uses<br>

>       ``type(dtype) is np.dtype``. When compiling against an old<br>

> NumPy version,<br>

>       the macro may have to be replaced with the corresponding<br>

>       ``PyObject_IsInstance`` call. (If this is a problem, we could<br>

> backport<br>

>       fixing the macro)<br>

> <br>

>    * The UFunc machinery changes will break *limited* parts of the<br>

> current<br>

>      implementation. Replacing e.g. the default ``TypeResolver`` is<br>

> expected<br>

>      to remain supported for a time, although optimized masked inner<br>

> loop iteration<br>

>      (which is not even used *within* NumPy) will no longer be<br>

> supported.<br>

> <br>

>    * All functions currently defined on the dtypes, such as<br>

>      ``PyArray_Descr->f->nonzero``, will be defined and accessed<br>

> differently.<br>

>      This means that in the long run lowlevel access code will<br>

>      have to be changed to use the new API. Such changes are expected<br>

> to be<br>

>      necessary in very few project.<br>

> <br>

> * **dtype implementors (C-API)**:<br>

> <br>

>   * The array which is currently provided to some functions (such as<br>

> cast functions),<br>

>     will no longer be provided.<br>

>     For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-<br>

> >copyswapn``,<br>

>     may instead receive a dummy array object with only some fields<br>

> (mainly the<br>

>     dtype), being valid.<br>

>     At least in some code paths, a similar mechanism is already used.<br>

> <br>

>   * The ``scalarkind`` slot and registration of scalar casting will<br>

> be<br>

>      removed/ignored without replacement.<br>

>      It currently allows partial value-based casting.<br>

>      The ``PyArray_ScalarKind`` function will continue to work for<br>

> builtin types,<br>

>      but will not be used internally and be deprecated.<br>

> <br>

>    * Currently user dtypes are defined as instances of ``np.dtype``.<br>

>      The creation works by the user providing a prototype instance.<br>

>      NumPy will need to modify at least the type during registration.<br>

>      This has no effect for either ``rational`` or ``quaternion`` and<br>

> mutation<br>

>      of the structure seems unlikely after registration.<br>

> <br>

> Since there is a fairly large API surface concerning datatypes,<br>

> further changes<br>

> or the limitation certain function to currently existing datatypes is<br>

> likely to occur.<br>

> For example functions which use the type number as input<br>

> should be replaced with functions taking DType classes instead.<br>

> Although public, large parts of this C-API seem to be used rarely,<br>

> possibly never, by downstream projects.<br>

> <br>

> <br>

> <br>

> Detailed Description<br>

> --------------------<br>

> <br>

> This section details the design decisions covered by this NEP.<br>

> The subsections correspond to the list of design choices presented<br>

> in the Scope section.<br>

> <br>

> Datatypes as Python Classes (1)<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> The current NumPy datatypes are not full scale python classes.<br>

> They are instead (prototype) instances of a single ``np.dtype``<br>

> class.<br>

> Changing this means that any special handling, e.g. for ``datetime``<br>

> can be moved to the Datetime DType class instead, away from<br>

> monolithic general<br>

> code (e.g. current ``PyArray_AdjustFlexibleDType``).<br>

> <br>

> The main consequence of this change with respect to the API is that<br>

> special methods move from the dtype instances to methods on the new<br>

> DType class.<br>

> This is the typical design pattern used in Python.<br>

> Organizing these methods and information in a more Pythonic way<br>

> provides a<br>

> solid foundation for refining and extending the API in the future.<br>

> The current API cannot be extended due to how it is exposed<br>

> publically.<br>

> This means for example that the methods currently stored in<br>

> ``PyArray_ArrFuncs``<br>

> on each datatype (see NEP 40) will be defined differently in the<br>

> future and<br>

> deprecated in the long run.<br>

> <br>

> The most prominent visible side effect of this will be that<br>

> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.<br>

> Instead it will be a subclass of ``np.dtype`` meaning that<br>

> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.<br>

> This will also add the ability to use ``isinstance(dtype,<br>

> np.dtype[float64])``<br>

> thus removing the need to use ``dtype.kind``, ``dtype.char``, or<br>

> ``dtype.type``<br>

> to do this check.<br>

> <br>

> With the design decision of DTypes as full-scale Python classes,<br>

> the question of subclassing arises.<br>

> Inheritance, however, appears problematic and a complexity best<br>

> avoided<br>

> (at least initially) for container datatypes.<br>

> Further, subclasses may be more interesting for interoperability for<br>

> example with GPU backends (CuPy) storing additional methods related<br>

> to the<br>

> GPU rather than as a mechanism to define new datatypes.<br>

> A class hierarchy does provides value, this may be achieved by<br>

> allowing the creation of *abstract* datatypes.<br>

> An example for an abstract datatype would be the datatype equivalent<br>

> of<br>

> ``np.floating``, representing any floating point number.<br>

> These can serve the same purpose as Python's abstract base classes.<br>

> <br>

> <br>

> Scalars should not be instances of the datatypes (2)<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> For simple datatypes such as ``float64`` (see also below), it seems<br>

> tempting that the instance of a ``np.dtype("float64")`` can be the<br>

> scalar.<br>

> This idea may be even more appealing due to the fact that scalars,<br>

> rather than datatypes, currently define a useful type hierarchy.<br>

> <br>

> However, we have specifically decided against this for a number of<br>

> reasons.<br>

> First, the new datatypes described herein would be instances of DType<br>

> classes.<br>

> Making these instances themselves classes, while possible, adds<br>

> additional<br>

> complexity that users need to understand.<br>

> It would also mean that scalars must have storage information (such<br>

> as byteorder)<br>

> which is generally unnecessary and currently is not used.<br>

> Second, while the simple NumPy scalars such as ``float64`` may be<br>

> such instances,<br>

> it should be possible to create datatypes for Python objects without<br>

> enforcing<br>

> NumPy as a dependency.<br>

> However, Python objects that do not depend on NumPy cannot be<br>

> instances of a NumPy DType.<br>

> Third, there is a mismatch between the methods and attributes which<br>

> are useful<br>

> for scalars and datatypes. For instance ``to_float()`` makes sense<br>

> for a scalar<br>

> but not for a datatype and ``newbyteorder`` is not useful on a scalar<br>

> (or has<br>

> a different meaning).<br>

> <br>

> Overall, it seem rather than reducing the complexity, i.e. by merging<br>

> the two distinct type hierarchies, making scalars instances of DTypes<br>

> would<br>

> increase the complexity of both the design and implementation.<br>

> <br>

> A possible future path may be to instead simplify the current NumPy<br>

> scalars to<br>

> be much simpler objects which largely derive their behaviour from the<br>

> datatypes.<br>

> <br>

> C-API for creating new Datatypes (3)<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> The current C-API with which users can create new datatypes<br>

> is limited in scope, and requires use of "private" structures. This<br>

> means<br>

> the API is not extensible: no new members can be added to the<br>

> structure<br>

> without losing binary compatibility.<br>

> This has already limited the inclusion of new sorting methods into<br>

> NumPy [new_sort]_.<br>

> <br>

> The new version shall thus replace the current ``PyArray_ArrFuncs``<br>

> structure used<br>

> to define new datatypes.<br>

> Datatypes that currently exist and are defined using these slots will<br>

> be<br>

> supported during a deprecation period.<br>

> <br>

> The most likely solution is to hide the implementation from the user<br>

> and thus make<br>

> it extensible in the future is to model the API after Python's stable<br>

> API [PEP-384]_:<br>

> <br>

> .. code-block:: C<br>

> <br>

>     static struct PyArrayMethodDef slots[] = {<br>

>         {NPY_dt_method, method_implementation},<br>

>         ...,<br>

>         {0, NULL}<br>

>     }<br>

> <br>

>     typedef struct{<br>

>       PyTypeObject *typeobj;  /* type of python scalar */<br>

>       ...;<br>

>       PyType_Slot *slots;<br>

>     } PyArrayDTypeMeta_Spec;<br>

> <br>

>     PyObject* PyArray_InitDTypeMetaFromSpec(<br>

>             PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec<br>

> *dtype_spec);<br>

> <br>

> The C-side slots should be designed to mirror Python side methods<br>

> such as ``dtype.__dtype_method__``, although the exposure to Python<br>

> is<br>

> a later step in the implementation to reduce the complexity of the<br>

> initial<br>

> implementation.<br>

> <br>

> <br>

> C-API Changes to the UFunc Machinery (4)<br>

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

> <br>

> Proposed changes to the UFunc machinery will be part of NEP 43.<br>

> However, the following changes will be necessary (see NEP 40 for a<br>

> detailed<br>

> description of the current implementation and its issues):<br>

> <br>

> * The current UFunc type resolution must be adapted to allow better<br>

> control<br>

>   for user-defined dtypes as well as resolve current inconsistencies.<br>

> * The inner-loop used in UFuncs must be expanded to include a return<br>

> value.<br>

>   Further, error reporting must be improved, and passing in dtype-<br>

> specific<br>

>   information enabled.<br>

>   This requires the modification of the inner-loop function signature<br>

> and<br>

>   addition of new hooks called before and after the inner-loop is<br>

> used.<br>

> <br>

> An important goal for any changes to the universal functions will be<br>

> to<br>

> allow the reuse of existing loops.<br>

> It should be easy for a new units datatype to fall back to existing<br>

> math<br>

> functions after handling the unit related computations.<br>

> <br>

> <br>

> Discussion<br>

> ----------<br>

> <br>

> See NEP 40 for a list of previous meetings and discussions.<br>

> <br>

> <br>

> References<br>

> ----------<br>

> <br>

> .. [pandas_extension_arrays] <br>

> <a href="https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types" rel="noreferrer" target="_blank">https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types</a><br>

> <br>

> .. _xarray_dtype_issue: <a href="https://github.com/pydata/xarray/issues/1262" rel="noreferrer" target="_blank">https://github.com/pydata/xarray/issues/1262</a><br>

> <br>

> .. [pygeos] <a href="https://github.com/caspervdw/pygeos" rel="noreferrer" target="_blank">https://github.com/caspervdw/pygeos</a><br>

> <br>

> .. [new_sort] <a href="https://github.com/numpy/numpy/pull/12945" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/12945</a><br>

> <br>

> .. [PEP-384] <a href="https://www.python.org/dev/peps/pep-0384/" rel="noreferrer" target="_blank">https://www.python.org/dev/peps/pep-0384/</a><br>

> <br>

> .. [PR 15508] <a href="https://github.com/numpy/numpy/pull/15508" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15508</a><br>

> <br>

> <br>

> Copyright<br>

> ---------<br>

> <br>

> This document has been placed in the public domain.<br>

> <br>

> <br>

> Acknowledgments<br>

> ---------------<br>

> <br>

> The effort to create new datatypes for NumPy has been discussed for<br>

> several<br>

> years in many different contexts and settings, making it impossible<br>

> to list everyone involved.<br>

> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and<br>

> Eric Wieser<br>

> for repeated in-depth discussion about datatype design.<br>

> We are very grateful for the community input in reviewing and<br>

> revising this<br>

> NEP and would like to thank especially Ross Barnowski and Ralf<br>

> Gommers.<br>

> <br>

> _______________________________________________<br>

> NumPy-Discussion mailing list<br>

> <a href="mailto:NumPy-Discussion@python.org" target="_blank">NumPy-Discussion@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/numpy-discussion</a><br>

<br>

_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@python.org" target="_blank">NumPy-Discussion@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/numpy-discussion</a><br>

</blockquote></div></div>