<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg <<a href="mailto:sebastian@sipsolutions.net">sebastian@sipsolutions.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi all,<br>
<br>
in the spirit of trying to keep this moving, can I assume that the main<br>
reason for little discussion is that the actual changes proposed are<br>
not very far reaching as of now? Or is the reason that this is a<br>
fairly complex topic that you need more time to think about it?<br></blockquote><div><br></div><div>Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents.</div><div><br></div><div>Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
If it is the latter, is there some way I can help with it? I tried to<br>
minimize how much is part of this initial NEP.<br>
<br>
If there is not much need for discussion, I would like to officially<br>
accept the NEP very soon, sending out an official one week notice in<br>
the next days.<br></blockquote><div><br></div><div>I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
To summarize one more time, the main point is that:<br></blockquote><div><br></div><div>This point seems fine, and I'm +1 for going ahead with the described parts of the technical design.</div><div><br></div><div>Cheers,</div><div>Ralf</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
type(np.dtype(np.float64))<br>
<br>
will be `np.dtype[float64]`, a subclass of dtype, so that:<br>
<br>
issubclass(np.dtype[float64], np.dtype)<br>
<br>
is true. This means that we will have one class for every current type<br>
number: `dtype.num`. The implementation of these subclasses will be a<br>
C-written (extension) MetaClass, all details of this class are supposed<br>
to remain experimental in flux at this time.<br>
<br>
Cheers<br>
<br>
Sebastian<br>
<br>
<br>
On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:<br>
> Hi all,<br>
> <br>
> I am pleased to propose NEP 41: First step towards a new Datatype<br>
> System <a href="https://numpy.org/neps/nep-0041-improved-dtype-support.html" rel="noreferrer" target="_blank">https://numpy.org/neps/nep-0041-improved-dtype-support.html</a><br>
> <br>
> This NEP motivates the larger restructure of the datatype machinery<br>
> in<br>
> NumPy and defines a few fundamental design aspects. The long term<br>
> user<br>
> impact will be allowing easier and more rich featured user defined<br>
> datatypes.<br>
> <br>
> As this is a large restructure, the NEP represents only the first<br>
> steps<br>
> with some additional information in further NEPs being drafted [1]<br>
> (this may be helpful to look at depending on the level of detail you<br>
> are interested in).<br>
> The NEP itself does not propose to add significant new public API.<br>
> Instead it proposes to move forward with an incremental internal<br>
> refactor and lays the foundation for this process.<br>
> <br>
> The main user facing change at this time is that datatypes will<br>
> become<br>
> classes (e.g. ``type(np.dtype("float64"))`` will be a float64<br>
> specific<br>
> class.<br>
> For most users, the main impact should be many new datatypes in the<br>
> long run (see the user impact section). However, for those interested<br>
> in API design within NumPy or with respect to implementing new<br>
> datatypes, this and the following NEPs are important decisions in the<br>
> future roadmap for NumPy.<br>
> <br>
> The current full text is reproduced below, although the above link is<br>
> probably a better way to read it.<br>
> <br>
> Cheers<br>
> <br>
> Sebastian<br>
> <br>
> <br>
> [1] NEP 40 gives some background information about the current<br>
> systems<br>
> and issues with it:<br>
> <a href="https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst</a><br>
> and NEP 42 being a first draft of how the new API may look like:<br>
> <br>
> <a href="https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst</a><br>
> (links to current rendered versions, check <br>
> <a href="https://github.com/numpy/numpy/pull/15505" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15505</a> and <br>
> <a href="https://github.com/numpy/numpy/pull/15507" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15507</a> for updates)<br>
> <br>
> <br>
> -------------------------------------------------------------------<br>
> ---<br>
> <br>
> <br>
> =================================================<br>
> NEP 41 — First step towards a new Datatype System<br>
> =================================================<br>
> <br>
> :title: Improved Datatype Support<br>
> :Author: Sebastian Berg<br>
> :Author: Stéfan van der Walt<br>
> :Author: Matti Picus<br>
> :Status: Draft<br>
> :Type: Standard Track<br>
> :Created: 2020-02-03<br>
> <br>
> <br>
> .. note::<br>
> <br>
> This NEP is part of a series of NEPs encompassing first<br>
> information<br>
> about the previous dtype implementation and issues with it in NEP<br>
> 40.<br>
> NEP 41 (this document) then provides an overview and generic<br>
> design<br>
> choices for the refactor.<br>
> Further NEPs 42 and 43 go into the technical details of the<br>
> datatype<br>
> and universal function related internal and external API changes.<br>
> In some cases it may be necessary to consult the other NEPs for a<br>
> full<br>
> picture of the desired changes and why these changes are<br>
> necessary.<br>
> <br>
> <br>
> Abstract<br>
> --------<br>
> <br>
> `Datatypes <data-type-objects-dtype>` in NumPy describe how to<br>
> interpret each<br>
> element in arrays. NumPy provides ``int``, ``float``, and ``complex``<br>
> numerical<br>
> types, as well as string, datetime, and structured datatype<br>
> capabilities.<br>
> The growing Python community, however, has need for more diverse<br>
> datatypes.<br>
> Examples are datatypes with unit information attached (such as<br>
> meters) or<br>
> categorical datatypes (fixed set of possible values).<br>
> However, the current NumPy datatype API is too limited to allow the<br>
> creation<br>
> of these.<br>
> <br>
> This NEP is the first step to enable such growth; it will lead to<br>
> a simpler development path for new datatypes.<br>
> In the long run the new datatype system will also support the<br>
> creation<br>
> of datatypes directly from Python rather than C.<br>
> Refactoring the datatype API will improve maintainability and<br>
> facilitate<br>
> development of both user-defined external datatypes,<br>
> as well as new features for existing datatypes internal to NumPy.<br>
> <br>
> <br>
> Motivation and Scope<br>
> --------------------<br>
> <br>
> .. seealso::<br>
> <br>
> The user impact section includes examples of what kind of new<br>
> datatypes<br>
> will be enabled by the proposed changes in the long run.<br>
> It may thus help to read these section out of order.<br>
> <br>
> Motivation<br>
> ^^^^^^^^^^<br>
> <br>
> One of the main issues with the current API is the definition of<br>
> typical<br>
> functions such as addition and multiplication for parametric<br>
> datatypes<br>
> (see also NEP 40) which require additional steps to determine the<br>
> output type.<br>
> For example when adding two strings of length 4, the result is a<br>
> string<br>
> of length 8, which is different from the input.<br>
> Similarly, a datatype which embeds a physical unit must calculate the<br>
> new unit<br>
> information: dividing a distance by a time results in a speed.<br>
> A related difficulty is that the :ref:`current casting rules<br>
> <_ufuncs.casting>`<br>
> -- the conversion between different datatypes --<br>
> cannot describe casting for such parametric datatypes implemented<br>
> outside of NumPy.<br>
> <br>
> This additional functionality for supporting parametric datatypes<br>
> introduces<br>
> increased complexity within NumPy itself,<br>
> and furthermore is not available to external user-defined datatypes.<br>
> In general the concerns of different datatypes are not well well-<br>
> encapsulated.<br>
> This burden is exacerbated by the exposure of internal C structures,<br>
> limiting the addition of new fields<br>
> (for example to support new sorting methods [new_sort]_).<br>
> <br>
> Currently there are many factors which limit the creation of new<br>
> user-defined<br>
> datatypes:<br>
> <br>
> * Creating casting rules for parametric user-defined dtypes is either<br>
> impossible<br>
> or so complex that it has never been attempted.<br>
> * Type promotion, e.g. the operation deciding that adding float and<br>
> integer<br>
> values should return a float value, is very valuable for numeric<br>
> datatypes<br>
> but is limited in scope for user-defined and especially parametric<br>
> datatypes.<br>
> * Much of the logic (e.g. promotion) is written in single functions<br>
> instead of being split as methods on the datatype itself.<br>
> * In the current design datatypes cannot have methods that do not<br>
> generalize<br>
> to other datatypes. For example a unit datatype cannot have a<br>
> ``.to_si()`` method to<br>
> easily find the datatype which would represent the same values in<br>
> SI units.<br>
> <br>
> The large need to solve these issues has driven the scientific<br>
> community<br>
> to create work-arounds in multiple projects implementing physical<br>
> units as an<br>
> array-like class instead of a datatype, which would generalize better<br>
> across<br>
> multiple array-likes (Dask, pandas, etc.).<br>
> Already, Pandas has made a push into the same direction with its<br>
> extension arrays [pandas_extension_arrays]_ and undoubtedly<br>
> the community would be best served if such new features could be<br>
> common<br>
> between NumPy, Pandas, and other projects.<br>
> <br>
> Scope<br>
> ^^^^^<br>
> <br>
> The proposed refactoring of the datatype system is a large<br>
> undertaking and<br>
> thus is proposed to be split into various phases, roughly:<br>
> <br>
> * Phase I: Restructure and extend the datatype infrastructure (This<br>
> NEP 41)<br>
> * Phase II: Incrementally define or rework API (Detailed largely in<br>
> NEPs 42/43)<br>
> * Phase III: Growth of NumPy and Scientific Python Ecosystem<br>
> capabilities.<br>
> <br>
> For a more detailed accounting of the various phases, see<br>
> "Plan to Approach the Full Refactor" in the Implementation section<br>
> below.<br>
> This NEP proposes to move ahead with the necessary creation of new<br>
> dtype<br>
> subclasses (Phase I),<br>
> and start working on implementing current functionality.<br>
> Within the context of this NEP all development will be fully private<br>
> API or<br>
> use preliminary underscored names which must be changed in the<br>
> future.<br>
> Most of the internal and public API choices are part of a second<br>
> Phase<br>
> and will be discussed in more detail in the following NEPs 42 and 43.<br>
> The initial implementation of this NEP will have little or no effect<br>
> on users,<br>
> but provides the necessary ground work for incrementally addressing<br>
> the<br>
> full rework.<br>
> <br>
> The implementation of this NEP and the following, implied large<br>
> rework of how<br>
> datatypes are defined in NumPy is expected to create small<br>
> incompatibilities<br>
> (see backward compatibility section).<br>
> However, a transition requiring large code adaption is not<br>
> anticipated and not<br>
> within scope.<br>
> <br>
> Specifically, this NEP makes the following design choices which are<br>
> discussed<br>
> in more details in the detailed description section:<br>
> <br>
> 1. Each datatype will be an instance of a subclass of ``np.dtype``,<br>
> with most of the<br>
> datatype-specific logic being implemented<br>
> as special methods on the class. In the C-API, these correspond to<br>
> specific<br>
> slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,<br>
> np.dtype)`` will remain true,<br>
> but ``type(f)`` will be a subclass of ``np.dtype`` rather than<br>
> just ``np.dtype`` itself.<br>
> The ``PyArray_ArrFuncs`` which are currently stored as a pointer<br>
> on the instance (as ``PyArray_Descr->f``),<br>
> should instead be stored on the class as typically done in Python.<br>
> In the future these may correspond to python side dunder methods.<br>
> Storage information such as itemsize and byteorder can differ<br>
> between<br>
> different dtype instances (e.g. "S3" vs. "S8") and will remain<br>
> part of the instance.<br>
> This means that in the long run the current lowlevel access to<br>
> dtype methods<br>
> will be removed (see ``PyArray_ArrFuncs`` in NEP 40).<br>
> <br>
> 2. The current NumPy scalars will *not* change, they will not be<br>
> instances of<br>
> datatypes. This will also be true for new datatypes, scalars will<br>
> not be<br>
> instances of a dtype (although ``isinstance(scalar, dtype)`` may<br>
> be made<br>
> to return ``True`` when appropriate).<br>
> <br>
> Detailed technical decisions to follow in NEP 42.<br>
> <br>
> Further, the public API will be designed in a way that is extensible<br>
> in the future:<br>
> <br>
> 3. All new C-API functions provided to the user will hide<br>
> implementation details<br>
> as much as possible. The public API should be an identical, but<br>
> limited,<br>
> version of the C-API used for the internal NumPy datatypes.<br>
> <br>
> The changes to the datatype system in Phase II must include a large<br>
> refactor of the<br>
> UFunc machinery, which will be further defined in NEP 43:<br>
> <br>
> 4. To enable all of the desired functionality for new user-defined<br>
> datatypes,<br>
> the UFunc machinery will be changed to replace the current<br>
> dispatching<br>
> and type resolution system.<br>
> The old system should be *mostly* supported as a legacy version<br>
> for some time.<br>
> <br>
> Additionally, as a general design principle, the addition of new<br>
> user-defined<br>
> datatypes will *not* change the behaviour of programs.<br>
> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or<br>
> ``b`` know<br>
> that ``c`` exists.<br>
> <br>
> <br>
> User Impact<br>
> -----------<br>
> <br>
> The current ecosystem has very few user-defined datatypes using<br>
> NumPy, the<br>
> two most prominent being: ``rational`` and ``quaternion``.<br>
> These represent fairly simple datatypes which are not strongly<br>
> impacted<br>
> by the current limitations.<br>
> However, we have identified a need for datatypes such as:<br>
> <br>
> * bfloat16, used in deep learning<br>
> * categorical types<br>
> * physical units (such as meters)<br>
> * datatypes for tracing/automatic differentiation<br>
> * high, fixed precision math<br>
> * specialized integer types such as int2, int24<br>
> * new, better datetime representations<br>
> * extending e.g. integer dtypes to have a sentinel NA value<br>
> * geometrical objects [pygeos]_<br>
> <br>
> Some of these are partially solved; for example unit capability is<br>
> provided<br>
> in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray`<br>
> subclasses.<br>
> Most of these datatypes, however, simply cannot be reasonably defined<br>
> right now.<br>
> An advantage of having such datatypes in NumPy is that they should<br>
> integrate<br>
> seamlessly with other array or array-like packages such as Pandas,<br>
> ``xarray`` [xarray_dtype_issue]_, or ``Dask``.<br>
> <br>
> The long term user impact of implementing this NEP will be to allow<br>
> both<br>
> the growth of the whole ecosystem by having such new datatypes, as<br>
> well as<br>
> consolidating implementation of such datatypes within NumPy to<br>
> achieve<br>
> better interoperability.<br>
> <br>
> <br>
> Examples<br>
> ^^^^^^^^<br>
> <br>
> The following examples represent future user-defined datatypes we<br>
> wish to enable.<br>
> These datatypes are not part the NEP and choices (e.g. choice of<br>
> casting rules)<br>
> are possibilities we wish to enable and do not represent<br>
> recommendations.<br>
> <br>
> Simple Numerical Types<br>
> """"""""""""""""""""""<br>
> <br>
> Mainly used where memory is a consideration, lower-precision numeric<br>
> types<br>
> such as :ref:```bfloat16`` <<br>
> <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Bfloat16_floating-point_format</a>>`<br>
> are common in other computational frameworks.<br>
> For these types the definitions of things such as ``np.common_type``<br>
> and<br>
> ``np.can_cast`` are some of the most important interfaces. Once they<br>
> support ``np.common_type``, it is (for the most part) possible to<br>
> find<br>
> the correct ufunc loop to call, since most ufuncs -- such as add --<br>
> effectively<br>
> only require ``np.result_type``::<br>
> <br>
> >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)<br>
> <br>
> and `~numpy.result_type` is largely identical to<br>
> `~numpy.common_type`.<br>
> <br>
> <br>
> Fixed, high precision math<br>
> """"""""""""""""""""""""""<br>
> <br>
> Allowing arbitrary precision or higher precision math is important in<br>
> simulations. For instance ``mpmath`` defines a precision::<br>
> <br>
> >>> import mpmath as mp<br>
> >>> print(mp.dps) # the current (default) precision<br>
> 15<br>
> <br>
> NumPy should be able to construct a native, memory-efficient array<br>
> from<br>
> a list of ``mpmath.mpf`` floating point objects::<br>
> <br>
> >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a<br>
> list)<br>
> >>> print(arr_15_dps) # Must find the correct precision from the<br>
> objects:<br>
> array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])<br>
> <br>
> We should also be able to specify the desired precision when<br>
> creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]``<br>
> to find the DType class (the notation is not part of this NEP),<br>
> which is then instantiated with the desired parameter.<br>
> This could also be written as ``MpfDType`` class::<br>
> <br>
> >>> arr_100_dps = np.array([1, 2, 3],<br>
> dtype=np.dtype[mp.mpf](dps=100))<br>
> >>> print(arr_15_dps + arr_100_dps)<br>
> array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])<br>
> <br>
> The ``mpf`` datatype can decide that the result of the operation<br>
> should be the<br>
> higher precision one of the two, so uses a precision of 100.<br>
> Furthermore, we should be able to define casting, for example as in::<br>
> <br>
> >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,<br>
> casting="safe")<br>
> True<br>
> >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,<br>
> casting="safe")<br>
> False # loses precision<br>
> >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,<br>
> casting="same_kind")<br>
> True<br>
> <br>
> Casting from float is a probably always at least a ``same_kind``<br>
> cast, but<br>
> in general, it is not safe::<br>
> <br>
> >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),<br>
> casting="safe")<br>
> False<br>
> <br>
> since a float64 has a higer precision than the ``mpf`` datatype with<br>
> ``dps=4``.<br>
> <br>
> Alternatively, we can say that::<br>
> <br>
> >>> np.common_type(np.dtype[mp.mpf](dps=5),<br>
> np.dtype[mp.mpf](dps=10))<br>
> np.dtype[mp.mpf](dps=10)<br>
> <br>
> And possibly even::<br>
> <br>
> >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)<br>
> np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I<br>
> believe)<br>
> <br>
> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``<br>
> safely.<br>
> <br>
> <br>
> Categoricals<br>
> """"""""""""<br>
> <br>
> Categoricals are interesting in that they can have fixed, predefined<br>
> values,<br>
> or can be dynamic with the ability to modify categories when<br>
> necessary.<br>
> The fixed categories (defined ahead of time) is the most straight<br>
> forward<br>
> categorical definition.<br>
> Categoricals are *hard*, since there are many strategies to implement<br>
> them,<br>
> suggesting NumPy should only provide the scaffolding for user-defined<br>
> categorical types. For instance::<br>
> <br>
> >>> cat = Categorical(["eggs", "spam", "toast"])<br>
> >>> breakfast = array(["eggs", "spam", "eggs", "toast"],<br>
> dtype=cat)<br>
> <br>
> could store the array very efficiently, since it knows that there are<br>
> only 3<br>
> categories.<br>
> Since a categorical in this sense knows almost nothing about the data<br>
> stored<br>
> in it, few operations makes, sense, although equality does:<br>
> <br>
> >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],<br>
> dtype=cat)<br>
> >>> breakfast == breakfast2<br>
> array[True, False, True, False])<br>
> <br>
> The categorical datatype could work like a dictionary: no two<br>
> items names can be equal (checked on dtype creation), so that the<br>
> equality<br>
> operation above can be performed very efficiently.<br>
> If the values define an order, the category labels (internally<br>
> integers) could<br>
> be ordered the same way to allow efficient sorting and comparison.<br>
> <br>
> Whether or not casting is defined from one categorical with less to<br>
> one with<br>
> strictly more values defined, is something that the Categorical<br>
> datatype would<br>
> need to decide. Both options should be available.<br>
> <br>
> <br>
> Unit on the Datatype<br>
> """"""""""""""""""""<br>
> <br>
> There are different ways to define Units, depending on how the<br>
> internal<br>
> machinery would be organized, one way is to have a single Unit<br>
> datatype<br>
> for every existing numerical type.<br>
> This will be written as ``Unit[float64]``, the unit itself is part of<br>
> the<br>
> DType instance ``Unit[float64]("m")`` is a ``float64`` with meters<br>
> attached::<br>
> <br>
> >>> from astropy import units<br>
> >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m #<br>
> meters<br>
> >>> print(meters)<br>
> array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))<br>
> <br>
> Note that units are a bit tricky. It is debatable, whether::<br>
> <br>
> >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))<br>
> <br>
> should be valid syntax (coercing the float scalars without a unit to<br>
> meters).<br>
> Once the array is created, math will work without any issue::<br>
> <br>
> >>> meters / (2 * unit.seconds)<br>
> array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))<br>
> <br>
> Casting is not valid from one unit to the other, but can be valid<br>
> between<br>
> different scales of the same dimensionality (although this may be<br>
> "unsafe")::<br>
> <br>
> >>> meters.astype(Unit[float64]("s"))<br>
> TypeError: Cannot cast meters to seconds.<br>
> >>> meters.astype(Unit[float64]("km"))<br>
> >>> # Convert to centimeter-gram-second (cgs) units:<br>
> >>> meters.astype(meters.dtype.to_cgs())<br>
> <br>
> The above notation is somewhat clumsy. Functions<br>
> could be used instead to convert between units.<br>
> There may be ways to make these more convenient, but those must be<br>
> left<br>
> for future discussions::<br>
> <br>
> >>> units.convert(meters, "km")<br>
> >>> units.to_cgs(meters)<br>
> <br>
> There are some open questions. For example, whether additional<br>
> methods<br>
> on the array object could exist to simplify some of the notions, and<br>
> how these<br>
> would percolate from the datatype to the ``ndarray``.<br>
> <br>
> The interaction with other scalars would likely be defined through::<br>
> <br>
> >>> np.common_type(np.float64, Unit)<br>
> Unit[np.float64](dimensionless)<br>
> <br>
> Ufunc output datatype determination can be more involved than for<br>
> simple<br>
> numerical dtypes since there is no "universal" output type::<br>
> <br>
> >>> np.multiply(meters, seconds).dtype != np.result_type(meters,<br>
> seconds)<br>
> <br>
> In fact ``np.result_type(meters, seconds)`` must error without<br>
> context<br>
> of the operation being done.<br>
> This example highlights how the specific ufunc loop<br>
> (loop with known, specific DTypes as inputs), has to be able to to<br>
> make<br>
> certain decisions before the actual calculation can start.<br>
> <br>
> <br>
> <br>
> Implementation<br>
> --------------<br>
> <br>
> Plan to Approach the Full Refactor<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> To address these issues in NumPy and enable new datatypes,<br>
> multiple development stages are required:<br>
> <br>
> * Phase I: Restructure and extend the datatype infrastructure (This<br>
> NEP)<br>
> <br>
> * Organize Datatypes like normal Python classes [`PR 15508`]_<br>
> <br>
> * Phase II: Incrementally define or rework API<br>
> <br>
> * Create a new and easily extensible API for defining new datatypes<br>
> and related functionality. (NEP 42)<br>
> <br>
> * Incrementally define all necessary functionality through the new<br>
> API (NEP 42):<br>
> <br>
> * Defining operations such as ``np.common_type``.<br>
> * Allowing to define casting between datatypes.<br>
> * Add functionality necessary to create a numpy array from Python<br>
> scalars<br>
> (i.e. ``np.array(...)``).<br>
> * …<br>
> <br>
> * Restructure how universal functions work (NEP 43), in order to:<br>
> <br>
> * make it possible to allow a `~numpy.ufunc` such as ``np.add``<br>
> to be<br>
> extended by user-defined datatypes such as Units.<br>
> <br>
> * allow efficient lookup for the correct implementation for user-<br>
> defined<br>
> datatypes.<br>
> <br>
> * enable reuse of existing code. Units should be able to use the<br>
> normal math loops and add additional logic to determine output<br>
> type.<br>
> <br>
> * Phase III: Growth of NumPy and Scientific Python Ecosystem<br>
> capabilities:<br>
> <br>
> * Cleanup of legacy behaviour where it is considered buggy or<br>
> undesirable.<br>
> * Provide a path to define new datatypes from Python.<br>
> * Assist the community in creating types such as Units or<br>
> Categoricals<br>
> * Allow strings to be used in functions such as ``np.equal`` or<br>
> ``np.add``.<br>
> * Remove legacy code paths within NumPy to improve long term<br>
> maintainability<br>
> <br>
> This document serves as a basis for phase I and provides the vision<br>
> and<br>
> motivation for the full project.<br>
> Phase I does not introduce any new user-facing features,<br>
> but is concerned with the necessary conceptual cleanup of the current<br>
> datatype system.<br>
> It provides a more "pythonic" datatype Python type object, with a<br>
> clear class hierarchy.<br>
> <br>
> The second phase is the incremental creation of all APIs necessary to<br>
> define<br>
> fully featured datatypes and reorganization of the NumPy datatype<br>
> system.<br>
> This phase will thus be primarily concerned with defining an,<br>
> initially preliminary, stable public API.<br>
> <br>
> Some of the benefits of a large refactor may only become evident<br>
> after the full<br>
> deprecation of the current legacy implementation (i.e. larger code<br>
> removals).<br>
> However, these steps are necessary for improvements to many parts of<br>
> the<br>
> core NumPy API, and are expected to make the implementation generally<br>
> easier to understand.<br>
> <br>
> The following figure illustrates the proposed design at a high level,<br>
> and roughly delineates the components of the overall design.<br>
> Note that this NEP only regards Phase I (shaded area),<br>
> the rest encompasses Phase II and the design choices are up for<br>
> discussion,<br>
> however, it highlights that the DType datatype class is the central,<br>
> necessary<br>
> concept:<br>
> <br>
> .. image:: _static/nep-0041-mindmap.svg<br>
> <br>
> <br>
> First steps directly related to this NEP<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> The required changes necessary to NumPy are large and touch many<br>
> areas<br>
> of the code base<br>
> but many of these changes can be addressed incrementally.<br>
> <br>
> To enable an incremental approach we will start by creating a C<br>
> defined<br>
> ``PyArray_DTypeMeta`` class with its instances being the ``DType``<br>
> classes,<br>
> subclasses of ``np.dtype``.<br>
> This is necessary to add the ability of storing custom slots on the<br>
> DType in C.<br>
> This ``DTypeMeta`` will be implemented first to then enable<br>
> incremental<br>
> restructuring of current code.<br>
> <br>
> The addition of ``DType`` will then enable addressing other changes<br>
> incrementally, some of which may begin before the settling the full<br>
> internal<br>
> API:<br>
> <br>
> 1. New machinery for array coercion, with the goal of enabling user<br>
> DTypes<br>
> with appropriate class methods.<br>
> 2. The replacement or wrapping of the current casting machinery.<br>
> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots<br>
> into<br>
> DType method slots.<br>
> <br>
> At this point, no or only very limited new public API will be added<br>
> and<br>
> the internal API is considered to be in flux.<br>
> Any new public API may be set up give warnings and will have leading<br>
> underscores<br>
> to indicate that it is not finalized and can be changed without<br>
> warning.<br>
> <br>
> <br>
> Backward compatibility<br>
> ----------------------<br>
> <br>
> While the actual backward compatibility impact of implementing Phase<br>
> I and II<br>
> are not yet fully clear, we anticipate, and accept the following<br>
> changes:<br>
> <br>
> * **Python API**:<br>
> <br>
> * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,<br>
> while right<br>
> now ``type(np.dtype("f8")) is np.dtype``.<br>
> Code should use ``isinstance`` checks, and in very rare cases may<br>
> have to<br>
> be adapted to use it.<br>
> <br>
> * **C-API**:<br>
> <br>
> * In old versions of NumPy ``PyArray_DescrCheck`` is a macro<br>
> which uses<br>
> ``type(dtype) is np.dtype``. When compiling against an old<br>
> NumPy version,<br>
> the macro may have to be replaced with the corresponding<br>
> ``PyObject_IsInstance`` call. (If this is a problem, we could<br>
> backport<br>
> fixing the macro)<br>
> <br>
> * The UFunc machinery changes will break *limited* parts of the<br>
> current<br>
> implementation. Replacing e.g. the default ``TypeResolver`` is<br>
> expected<br>
> to remain supported for a time, although optimized masked inner<br>
> loop iteration<br>
> (which is not even used *within* NumPy) will no longer be<br>
> supported.<br>
> <br>
> * All functions currently defined on the dtypes, such as<br>
> ``PyArray_Descr->f->nonzero``, will be defined and accessed<br>
> differently.<br>
> This means that in the long run lowlevel access code will<br>
> have to be changed to use the new API. Such changes are expected<br>
> to be<br>
> necessary in very few project.<br>
> <br>
> * **dtype implementors (C-API)**:<br>
> <br>
> * The array which is currently provided to some functions (such as<br>
> cast functions),<br>
> will no longer be provided.<br>
> For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-<br>
> >copyswapn``,<br>
> may instead receive a dummy array object with only some fields<br>
> (mainly the<br>
> dtype), being valid.<br>
> At least in some code paths, a similar mechanism is already used.<br>
> <br>
> * The ``scalarkind`` slot and registration of scalar casting will<br>
> be<br>
> removed/ignored without replacement.<br>
> It currently allows partial value-based casting.<br>
> The ``PyArray_ScalarKind`` function will continue to work for<br>
> builtin types,<br>
> but will not be used internally and be deprecated.<br>
> <br>
> * Currently user dtypes are defined as instances of ``np.dtype``.<br>
> The creation works by the user providing a prototype instance.<br>
> NumPy will need to modify at least the type during registration.<br>
> This has no effect for either ``rational`` or ``quaternion`` and<br>
> mutation<br>
> of the structure seems unlikely after registration.<br>
> <br>
> Since there is a fairly large API surface concerning datatypes,<br>
> further changes<br>
> or the limitation certain function to currently existing datatypes is<br>
> likely to occur.<br>
> For example functions which use the type number as input<br>
> should be replaced with functions taking DType classes instead.<br>
> Although public, large parts of this C-API seem to be used rarely,<br>
> possibly never, by downstream projects.<br>
> <br>
> <br>
> <br>
> Detailed Description<br>
> --------------------<br>
> <br>
> This section details the design decisions covered by this NEP.<br>
> The subsections correspond to the list of design choices presented<br>
> in the Scope section.<br>
> <br>
> Datatypes as Python Classes (1)<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> The current NumPy datatypes are not full scale python classes.<br>
> They are instead (prototype) instances of a single ``np.dtype``<br>
> class.<br>
> Changing this means that any special handling, e.g. for ``datetime``<br>
> can be moved to the Datetime DType class instead, away from<br>
> monolithic general<br>
> code (e.g. current ``PyArray_AdjustFlexibleDType``).<br>
> <br>
> The main consequence of this change with respect to the API is that<br>
> special methods move from the dtype instances to methods on the new<br>
> DType class.<br>
> This is the typical design pattern used in Python.<br>
> Organizing these methods and information in a more Pythonic way<br>
> provides a<br>
> solid foundation for refining and extending the API in the future.<br>
> The current API cannot be extended due to how it is exposed<br>
> publically.<br>
> This means for example that the methods currently stored in<br>
> ``PyArray_ArrFuncs``<br>
> on each datatype (see NEP 40) will be defined differently in the<br>
> future and<br>
> deprecated in the long run.<br>
> <br>
> The most prominent visible side effect of this will be that<br>
> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.<br>
> Instead it will be a subclass of ``np.dtype`` meaning that<br>
> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.<br>
> This will also add the ability to use ``isinstance(dtype,<br>
> np.dtype[float64])``<br>
> thus removing the need to use ``dtype.kind``, ``dtype.char``, or<br>
> ``dtype.type``<br>
> to do this check.<br>
> <br>
> With the design decision of DTypes as full-scale Python classes,<br>
> the question of subclassing arises.<br>
> Inheritance, however, appears problematic and a complexity best<br>
> avoided<br>
> (at least initially) for container datatypes.<br>
> Further, subclasses may be more interesting for interoperability for<br>
> example with GPU backends (CuPy) storing additional methods related<br>
> to the<br>
> GPU rather than as a mechanism to define new datatypes.<br>
> A class hierarchy does provides value, this may be achieved by<br>
> allowing the creation of *abstract* datatypes.<br>
> An example for an abstract datatype would be the datatype equivalent<br>
> of<br>
> ``np.floating``, representing any floating point number.<br>
> These can serve the same purpose as Python's abstract base classes.<br>
> <br>
> <br>
> Scalars should not be instances of the datatypes (2)<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> For simple datatypes such as ``float64`` (see also below), it seems<br>
> tempting that the instance of a ``np.dtype("float64")`` can be the<br>
> scalar.<br>
> This idea may be even more appealing due to the fact that scalars,<br>
> rather than datatypes, currently define a useful type hierarchy.<br>
> <br>
> However, we have specifically decided against this for a number of<br>
> reasons.<br>
> First, the new datatypes described herein would be instances of DType<br>
> classes.<br>
> Making these instances themselves classes, while possible, adds<br>
> additional<br>
> complexity that users need to understand.<br>
> It would also mean that scalars must have storage information (such<br>
> as byteorder)<br>
> which is generally unnecessary and currently is not used.<br>
> Second, while the simple NumPy scalars such as ``float64`` may be<br>
> such instances,<br>
> it should be possible to create datatypes for Python objects without<br>
> enforcing<br>
> NumPy as a dependency.<br>
> However, Python objects that do not depend on NumPy cannot be<br>
> instances of a NumPy DType.<br>
> Third, there is a mismatch between the methods and attributes which<br>
> are useful<br>
> for scalars and datatypes. For instance ``to_float()`` makes sense<br>
> for a scalar<br>
> but not for a datatype and ``newbyteorder`` is not useful on a scalar<br>
> (or has<br>
> a different meaning).<br>
> <br>
> Overall, it seem rather than reducing the complexity, i.e. by merging<br>
> the two distinct type hierarchies, making scalars instances of DTypes<br>
> would<br>
> increase the complexity of both the design and implementation.<br>
> <br>
> A possible future path may be to instead simplify the current NumPy<br>
> scalars to<br>
> be much simpler objects which largely derive their behaviour from the<br>
> datatypes.<br>
> <br>
> C-API for creating new Datatypes (3)<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> The current C-API with which users can create new datatypes<br>
> is limited in scope, and requires use of "private" structures. This<br>
> means<br>
> the API is not extensible: no new members can be added to the<br>
> structure<br>
> without losing binary compatibility.<br>
> This has already limited the inclusion of new sorting methods into<br>
> NumPy [new_sort]_.<br>
> <br>
> The new version shall thus replace the current ``PyArray_ArrFuncs``<br>
> structure used<br>
> to define new datatypes.<br>
> Datatypes that currently exist and are defined using these slots will<br>
> be<br>
> supported during a deprecation period.<br>
> <br>
> The most likely solution is to hide the implementation from the user<br>
> and thus make<br>
> it extensible in the future is to model the API after Python's stable<br>
> API [PEP-384]_:<br>
> <br>
> .. code-block:: C<br>
> <br>
> static struct PyArrayMethodDef slots[] = {<br>
> {NPY_dt_method, method_implementation},<br>
> ...,<br>
> {0, NULL}<br>
> }<br>
> <br>
> typedef struct{<br>
> PyTypeObject *typeobj; /* type of python scalar */<br>
> ...;<br>
> PyType_Slot *slots;<br>
> } PyArrayDTypeMeta_Spec;<br>
> <br>
> PyObject* PyArray_InitDTypeMetaFromSpec(<br>
> PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec<br>
> *dtype_spec);<br>
> <br>
> The C-side slots should be designed to mirror Python side methods<br>
> such as ``dtype.__dtype_method__``, although the exposure to Python<br>
> is<br>
> a later step in the implementation to reduce the complexity of the<br>
> initial<br>
> implementation.<br>
> <br>
> <br>
> C-API Changes to the UFunc Machinery (4)<br>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>
> <br>
> Proposed changes to the UFunc machinery will be part of NEP 43.<br>
> However, the following changes will be necessary (see NEP 40 for a<br>
> detailed<br>
> description of the current implementation and its issues):<br>
> <br>
> * The current UFunc type resolution must be adapted to allow better<br>
> control<br>
> for user-defined dtypes as well as resolve current inconsistencies.<br>
> * The inner-loop used in UFuncs must be expanded to include a return<br>
> value.<br>
> Further, error reporting must be improved, and passing in dtype-<br>
> specific<br>
> information enabled.<br>
> This requires the modification of the inner-loop function signature<br>
> and<br>
> addition of new hooks called before and after the inner-loop is<br>
> used.<br>
> <br>
> An important goal for any changes to the universal functions will be<br>
> to<br>
> allow the reuse of existing loops.<br>
> It should be easy for a new units datatype to fall back to existing<br>
> math<br>
> functions after handling the unit related computations.<br>
> <br>
> <br>
> Discussion<br>
> ----------<br>
> <br>
> See NEP 40 for a list of previous meetings and discussions.<br>
> <br>
> <br>
> References<br>
> ----------<br>
> <br>
> .. [pandas_extension_arrays] <br>
> <a href="https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types" rel="noreferrer" target="_blank">https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types</a><br>
> <br>
> .. _xarray_dtype_issue: <a href="https://github.com/pydata/xarray/issues/1262" rel="noreferrer" target="_blank">https://github.com/pydata/xarray/issues/1262</a><br>
> <br>
> .. [pygeos] <a href="https://github.com/caspervdw/pygeos" rel="noreferrer" target="_blank">https://github.com/caspervdw/pygeos</a><br>
> <br>
> .. [new_sort] <a href="https://github.com/numpy/numpy/pull/12945" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/12945</a><br>
> <br>
> .. [PEP-384] <a href="https://www.python.org/dev/peps/pep-0384/" rel="noreferrer" target="_blank">https://www.python.org/dev/peps/pep-0384/</a><br>
> <br>
> .. [PR 15508] <a href="https://github.com/numpy/numpy/pull/15508" rel="noreferrer" target="_blank">https://github.com/numpy/numpy/pull/15508</a><br>
> <br>
> <br>
> Copyright<br>
> ---------<br>
> <br>
> This document has been placed in the public domain.<br>
> <br>
> <br>
> Acknowledgments<br>
> ---------------<br>
> <br>
> The effort to create new datatypes for NumPy has been discussed for<br>
> several<br>
> years in many different contexts and settings, making it impossible<br>
> to list everyone involved.<br>
> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and<br>
> Eric Wieser<br>
> for repeated in-depth discussion about datatype design.<br>
> We are very grateful for the community input in reviewing and<br>
> revising this<br>
> NEP and would like to thank especially Ross Barnowski and Ralf<br>
> Gommers.<br>
> <br>
> _______________________________________________<br>
> NumPy-Discussion mailing list<br>
> <a href="mailto:NumPy-Discussion@python.org" target="_blank">NumPy-Discussion@python.org</a><br>
> <a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/numpy-discussion</a><br>
<br>
_______________________________________________<br>
NumPy-Discussion mailing list<br>
<a href="mailto:NumPy-Discussion@python.org" target="_blank">NumPy-Discussion@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/numpy-discussion</a><br>
</blockquote></div></div>