[Numpy-discussion] DType Roadmap/NEP Discussion

Thu Sep 19 00:33:56 EDT 2019

Hi Sebastian,

On Wed, Sep 18, 2019 at 4:35 PM Sebastian Berg <sebastian at sipsolutions.net>
wrote:

> Hi all,
>
> to try and make some progress towards a decision since the broad design
> is pretty much settling from my side. I am thinking about making a
> meeting, and suggest Monday at 11am Pacific Time (I am open to other
> times though).
>
> My hope is to get everyone interested on board, so that we can make an
> informed decision about the general direction very soon. So just reach
> out, or discuss on the mailing list as well.
>
> The current draft for an NEP is here:
> https://hackmd.io/kxuh15QGSjueEKft5SaMug?both
>
> There are some design goals that I would like to clear up.

The design itself seems very sensible to me insofar as I understand it.
After having read your document again, I think you're still missing the
actual goals though. "structure of class layout" and "type hierarchy" are
important, but they're not the goals. You're touching on the real goals in
places, but it may be valuable to be much more explicit there.

Here are some example goals:

1. Make creating new dtypes via the NumPy C API take >4x less lines of code
on average (in practice: for rational/quaternion, hard to measure
otherwise).

2. Make it possible to create new dypes with full functionality via the
NumPy Python API. Performance may be up to 1-2 orders of magnitude worse
than when creating the same dtype via the C API; the main purpose is to
allow easier prototyping of new dtypes.

3. Make the NumPy codebase more maintainable by removing special-casing of
datetime dtypes in many places.

4. Enable creation of a units library whose arrays *are* numpy arrays
rather than a subclass or duck array. This will make such a library work
much better with SciPy and other existing libraries that use np.asarray
extensively.

5. Hide currently exposed implementation details in the C API so long-term
.... (you have this one, but it would be nice to work it out a little more
- after all we recently considered reverting the deprecation for direct
field access, so how important is this?)

6. Improve casting behavior for external dtypes

7. Make np.char behavior better <in ... ways> (you mention fixed length
strings work poorly now, but not what would change)

Listing non-goals would also be useful:

1. Performance: no significant performance improvements are expected. We
aim for no performance regressions.

2. Introducing new dtypes into NumPy itself

3. Pandas ExtensionArrays? You mention them, but does this dtype redesign
help Pandas in any way or not?

4. Changes to NumPy's current casting rules

5. Allow creation of dtypes that don't fit the current NumPy model of what
a dtype is (e.g. ref [1]), such as a variable-length string dtype.

Many of those (and there can be more, this is just what came to mind now)
can/should be a paragraph or section. In my experience describing these
goals and requirements well takes about 15-30% of the length of the design
description. Think of for example a Pandas or units library maintainer
reading this: they should be able to stop reading at where you now have
"Overview Graphic" and have a pretty clear high-level understanding of what
this whole redesign will mean for them. Same for a NumPy maintainer who
wants to get a sense of what the benefits and impacts will be: reading only
(the expanded version of) your Abstract, Motivation and Scope, and
Backwards Compatibility sections should be enough.

Here's a concrete question, that's the type of thing I'd like to understand
without having to understand the whole design in detail:
```
>>> import datetime

>>> import pandas as pd

>>> import datetime

>>> dti = pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01'),
...                       datetime.datetime(2018, 1, 1)])

>>>

>>> dti.values

array(['2018-01-01T00:00:00.000000000', '2018-01-01T00:00:00.000000000',
       '2018-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> dti.values.dtype

dtype('<M8[ns]')
>>> isinstance(dti.values.dtype, np.dtype)

True
>>> dti.dtype == dti.values.dtype      # okay, that's nice

True

>>> start = pd.to_datetime('2015-02-24')

>>> rng = pd.date_range(start, periods=3)

>>> t = pd.Series(rng)

>>> t_withzone = t.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')

>>> t_withzone

0   2015-02-24 05:30:00+05:30
1   2015-02-25 05:30:00+05:30
2   2015-02-26 05:30:00+05:30
dtype: datetime64[ns, Asia/Kolkata]
>>> t_withzone.dtype

datetime64[ns, Asia/Kolkata]
>>> t_withzone.values.dtype

dtype('<M8[ns]')
>>> t_withzone.dtype == t_withzone.values.dtype    # could this be True in
the future?
False
```
So can Pandas create timezone-aware numpy dtypes in the future if they want
to, or would they still be better off rolling their own?

Also one question/comment about the design content. When looking at the
current external dtypes (e.g. [2]), a large part of the work of
implementing a new dtype now deals with ufunc behavior. It's not clear from
your document how that changes with the new design, can you add something
about that?

Cheers,
Ralf

[1]
http://scipy-lectures.org/advanced/advanced_numpy/index.html#the-descriptor
[2] https://github.com/moble/quaternion/blob/master/numpy_quaternion.c
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190918/b928002d/attachment-0001.html>