New user dtypes and the buffer protocol
Hi all, As you may know, I'm currently working on a variable-width string dtype using the new experimental user dtype API. As part of this work I'm running into papercuts that future dtype authors will likely hit and I've been trying to fix them as I go. One issue I'd like to raise with the list is that the Python buffer protocol and the `__array_interface__` protocol support a limited set of data types. This leads to three concrete issues I'm working around: * The `npy` file format uses the type strings defined by the `__array_interface__` protocol, so any type that doesn't have a type string defined in that protocol cannot currently be saved [1]. * Cython uses the buffer protocol in its support for numpy arrays and in the typed memoryview interface so that means any array with a dtype that doesn't support the buffer protocol cannot be accessed using idiomatic cython code [2]. The same issue means cython can't easily support float16 or datetime dtypes [3]. * Currently new dtypes don't have a way to export a string version of themselves that numpy can subsequently load (implicitly importing the dtype). This makes it more awkward to update downstream libraries that currently treat dtypes as strings. One way to fix this is to define an ad-hoc extension to the buffer protocol. Officially, the buffer protocol only supports the format codes used in the struct module [4]. Unofficially, memoryview doesn't raise a NotImplementedError if you pass it an invalid format code, only raising an error when it tries to access the data. This means we can stuff an arbitrary string into the format code. See the proposal from Sebastian on the Python Discuss forum [5] and his proof-of-concept [6]. The hardest issue with this approach is that it's a social problem, requiring cross-project coordination with at least Cython, and possibly a PEP to standardize whatever extension to the buffer protocol we come up with. Another option would be to exchange data using the arrow data format [7], which already supports many of the kinds of memory layouts custom dtype authors might want to use and supports defining custom data types [8]. The big issue here is that NumPy probably can't depend on the arrow C++ library (I think?) so we would need to write a bunch of code to support arrow data layouts and data types, but then we would also need to do the same thing on the Cython side. Implementing either of these approaches fixes the issues I enumerated above at the cost of some added complexity. We don't necessarily have to make an immediate decision for my work to be viable, I can work around most of these issues, but I think now is probably the time to raise this as an issue and see if anyone has strong opinions about what NumPy should ultimately do. I've raised this on the Cython mailing list to get their take as well [9]. [1] https://github.com/numpy/numpy/issues/24110 [2] https://github.com/numpy/numpy/issues/18442 [3] https://github.com/numpy/numpy/issues/4983 [4] https://docs.python.org/3/library/struct.html#format-strings [5] https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256 [6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546 [7] https://arrow.apache.org/docs/format/Columnar.html [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html
I wonder if the dlpack protocol can be helpful for these kinds of dtypes? On Thu, Jul 6, 2023 at 7:56 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hi all,
As you may know, I'm currently working on a variable-width string dtype
using the new experimental user dtype API. As part of this work I'm running into papercuts that future dtype authors will likely hit and I've been trying to fix them as I go.
One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set of data types.
This leads to three concrete issues I'm working around:
* The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type string defined in that protocol cannot currently be saved [1].
* Cython uses the buffer protocol in its support for numpy arrays and
in the typed memoryview interface so that means any array with a dtype that doesn't support the buffer protocol cannot be accessed using idiomatic cython code [2]. The same issue means cython can't easily support float16 or datetime dtypes [3].
* Currently new dtypes don't have a way to export a string version of
themselves that numpy can subsequently load (implicitly importing the dtype). This makes it more awkward to update downstream libraries that currently treat dtypes as strings.
One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format codes used in the struct module [4]. Unofficially, memoryview doesn't raise a NotImplementedError if you pass it an invalid format code, only raising an error when it tries to access the data. This means we can stuff an arbitrary string into the format code. See the proposal from Sebastian on the Python Discuss forum [5] and his proof-of-concept [6]. The hardest issue with this approach is that it's a social problem, requiring cross-project coordination with at least Cython, and possibly a PEP to standardize whatever extension to the buffer protocol we come up with.
Another option would be to exchange data using the arrow data format [7],
which already supports many of the kinds of memory layouts custom dtype authors might want to use and supports defining custom data types [8]. The big issue here is that NumPy probably can't depend on the arrow C++ library (I think?) so we would need to write a bunch of code to support arrow data layouts and data types, but then we would also need to do the same thing on the Cython side.
Implementing either of these approaches fixes the issues I enumerated
above at the cost of some added complexity. We don't necessarily have to make an immediate decision for my work to be viable, I can work around most of these issues, but I think now is probably the time to raise this as an issue and see if anyone has strong opinions about what NumPy should ultimately do.
I've raised this on the Cython mailing list to get their take as well [9].
[1] https://github.com/numpy/numpy/issues/24110 [2] https://github.com/numpy/numpy/issues/18442 [3] https://github.com/numpy/numpy/issues/4983 [4] https://docs.python.org/3/library/struct.html#format-strings [5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546 [7] https://arrow.apache.org/docs/format/Columnar.html [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: evgeny.burovskiy@gmail.com
On 6/7/23 20:44, Evgeni Burovski wrote:
On Thu, Jul 6, 2023 at 7:56 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hi all,
As you may know, I'm currently working on a variable-width string
dtype using the new experimental user dtype API. As part of this work I'm running into papercuts that future dtype authors will likely hit and I've been trying to fix them as I go.
One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set of data types.
This leads to three concrete issues I'm working around:
* The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type string defined in that protocol cannot currently be saved [1].
* Cython uses the buffer protocol in its support for numpy
arrays and in the typed memoryview interface so that means any array with a dtype that doesn't support the buffer protocol cannot be accessed using idiomatic cython code [2]. The same issue means cython can't easily support float16 or datetime dtypes [3].
* Currently new dtypes don't have a way to export a string
version of themselves that numpy can subsequently load (implicitly importing the dtype). This makes it more awkward to update downstream libraries that currently treat dtypes as strings.
One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format codes used in the struct module [4]. Unofficially, memoryview doesn't raise a NotImplementedError if you pass it an invalid format code, only raising an error when it tries to access the data. This means we can stuff an arbitrary string into the format code. See the proposal from Sebastian on the Python Discuss forum [5] and his proof-of-concept [6]. The hardest issue with this approach is that it's a social problem, requiring cross-project coordination with at least Cython, and possibly a PEP to standardize whatever extension to the buffer protocol we come up with.
Another option would be to exchange data using the arrow data format
[7], which already supports many of the kinds of memory layouts custom dtype authors might want to use and supports defining custom data types [8]. The big issue here is that NumPy probably can't depend on the arrow C++ library (I think?) so we would need to write a bunch of code to support arrow data layouts and data types, but then we would also need to do the same thing on the Cython side.
Implementing either of these approaches fixes the issues I
enumerated above at the cost of some added complexity. We don't necessarily have to make an immediate decision for my work to be viable, I can work around most of these issues, but I think now is probably the time to raise this as an issue and see if anyone has strong opinions about what NumPy should ultimately do.
I've raised this on the Cython mailing list to get their take as
well [9].
[1] https://github.com/numpy/numpy/issues/24110 [2] https://github.com/numpy/numpy/issues/18442 [3] https://github.com/numpy/numpy/issues/4983 [4] https://docs.python.org/3/library/struct.html#format-strings [5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546 [7] https://arrow.apache.org/docs/format/Columnar.html [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html
I wonder if the dlpack protocol can be helpful for these kinds of dtypes?
No. DLPack has an enum for a fixed number of known dtypes [0], and adding new ones is non-trivial. [0] https://github.com/dmlc/dlpack/blob/ca4d00ad3e2e0f410eeab3264d21b8a39397f362... Matti
participants (3)
-
Evgeni Burovski
-
Matti Picus
-
Nathan