Mailman 3 New user dtypes and the buffer protocol - NumPy-Discussion

July 6, 2023

      Hi all,

As you may know, I'm currently working on a variable-width string dtype
using the new experimental user dtype API. As part of this work I'm running
into papercuts that future dtype authors will likely hit and I've been
trying to fix them as I go.

One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set of
data types.

This leads to three concrete issues I'm working around:

   * The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type string
defined in that protocol cannot currently be saved [1].

    * Cython uses the buffer protocol in its support for numpy arrays and
in the typed memoryview interface so that means any array with a dtype that
doesn't support the buffer protocol cannot be accessed using idiomatic
cython code [2]. The same issue means cython can't easily support float16
or datetime dtypes [3].

    * Currently new dtypes don't have a way to export a string version of
themselves that numpy can subsequently load (implicitly importing the
dtype). This makes it more awkward to update downstream libraries that
currently treat dtypes as strings.

One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format codes
used in the struct module [4]. Unofficially, memoryview doesn't raise a
NotImplementedError if you pass it an invalid format code, only raising an
error when it tries to access the data. This means we can stuff an
arbitrary string into the format code. See the proposal from Sebastian on
the Python Discuss forum [5] and his proof-of-concept [6]. The hardest
issue with this approach is that it's a social problem, requiring
cross-project coordination with at least Cython, and possibly a PEP to
standardize whatever extension to the buffer protocol we come up with.

Another option would be to exchange data using the arrow data format [7],
which already supports many of the kinds of memory layouts custom dtype
authors might want to use and supports defining custom data types [8]. The
big issue here is that NumPy probably can't depend on the arrow C++ library
(I think?) so we would need to write a bunch of code to support arrow data
layouts and data types, but then we would also need to do the same thing on
the Cython side.

Implementing either of these approaches fixes the issues I enumerated above
at the cost of some added complexity. We don't necessarily have to make an
immediate decision for my work to be viable, I can work around most of
these issues, but I think now is probably the time to raise this as an
issue and see if anyone has strong opinions about what NumPy should
ultimately do.

I've raised this on the Cython mailing list to get their take as well [9].

[1] https://github.com/numpy/numpy/issues/24110
[2] https://github.com/numpy/numpy/issues/18442
[3] https://github.com/numpy/numpy/issues/4983
[4] https://docs.python.org/3/library/struct.html#format-strings
[5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546
[7] https://arrow.apache.org/docs/format/Columnar.html
[8] https://arrow.apache.org/docs/format/Columnar.html#extension-types
[9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html

New user dtypes and the buffer protocol

Nathan

Evgeni Burovski

Matti Picus

tags

participants (3)