[Cython] Appetite for working with upstream to extend the buffer protocol?

Thu Jul 6 12:43:44 EDT 2023

Hi all,

I'm working on a new data type for numpy to represent arrays of
variable-width strings [1]. One limitation of using this data type right
now is it's not possible to write idiomatic cython code operating on the
array, instead one would need to use e.g. the NumPy iterator API. It turns
out this is a papercut that's been around for a while and is most
noticeable downstream because datetime arrays cannot be passed to Cython.

Here's an example of a downstream library working around lack of support in
Cython for datetimes using an iterator: [2]. Pandas works around this by
passing int64 views of the arrays to Cython. I think this issue will become
more problematic in the future when NumPy officially ships the NEP 42
custom dtype API, which will make it much easier to develop custom data
types. It is also already an issue for the legacy custom data types numpy
already supports [3], but those aren't very popular so it hasn't come up
much.

I'm curious if there's any appetite among the Cython developers to
ultimately make it easier to write cython code that works with numpy arrays
that have user-defined data types. Currently it's only possible to write
code using the numpy or typed memoryview interfaces for arrays with
datatypes that support the buffer protocol. See e.g.
https://github.com/numpy/numpy/issues/4983.

One approach to fix this would be to either officially or unofficially
extend the buffer protocol to allow arbitrary typecodes to be sent in the
format string. Officially Python only supports format codes used in the
struct module, but in practice you can put any string in the format code
and memoryview will accept it. Of course for it actually to be useful NumPy
would need to create format codes that allow cython to correctly read and
reconstruct the type.

Sebastian Berg proposed this on the CPython discussion forum [3] and there
hasn't been much response from upstream. I response to Sebastian, Michael
Droettboom suggested [4] using the Arrow data format, which has rich
support for various array memory layouts and has support for exchanging
custom extension types [5].

The main problem with the buffer protocol approach is defining the protocol
in such a way that Cython can appropriately reconstruct the memory layout
for the data type (although only supporting strided arrays at first makes a
lot of sense) for an arbitrary user-defined data type, ideally without
needing to import any code defining the data type.

The main problem with the approach using Apache Arrow is neither Cython or
Numpy has any support for it and I don't think either library can depend on
Arrow so both would need to write custom serializers and parsers, whereas
Cython already has memoryviews fully working.

Guido van Rossum wanted some more discussion about this, so I'm raising
this as an issue here in case any Cython developers are interested. Please
chime in on the python disourse thread if so.

-Nathan

[1] https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
[2] https://github.com/scikit-hep/awkward/issues/367
[3] https://github.com/numpy/numpy/issues/18442
[4]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256/3
[6] https://arrow.apache.org/docs/format/Columnar.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/cython-devel/attachments/20230706/e50f1171/attachment.html>