Standard for dtype string representation?

Hi, We are in the process of using a standard representation of data types for the forthcoming version of N-dim arrays in C-Blosc2, and we want to use the NumPy string representation for that (see the end of https://github.com/Blosc/c-blosc2/blob/main/README_B2ND_METALAYER.rst). It might seem a bit strange to use the specification of a Python package for that, but provided its predominant role in data science, I don't think this should com as a surprise to anyone. There are some small gotchas though. For simple data types, the string representation is *apparently* fine. E.g.: In [16]: str(np.dtype("i8")) Out[16]: 'int64' However, as long as we try to represent the endianness of the type, we get: In [17]: str(np.dtype(">i8")) Out[17]: '>i8' So, it uses the short version of the representation. And the same happens with the structured types: In [22]: str(np.dtype("S1,i8")) Out[22]: "[('f0', 'S1'), ('f1', '<i8')]" Finally, the endianness seems to be represented arbitrarily. E.g. in: In [23]: str(np.dtype("S1")) Out[23]: '|S1' one can note the '|' char is prefixed to indicate endian independency, while it does not appear in the structured representation. While I know that there are some other representations for types in NumPy (e.g. numeric integers via dtype.num), I very much appreciate (and I suppose the same should go for other makers of numerical libraries) the expressiveness of str(dtype), specially when it comes to structured dtypes, if not were by the (relatively small) inconsistencies listed above. BTW, I have had a quick glance at the Python array API standard effort ( https://data-apis.org/array-api/latest/API_specification/data_types.html#dat...), but it does not seem this is being addressed. For now, (and for the Python-Blosc2 wrapper) we are going in this direction: if dtype.kind == 'V': repr = str(dtype) else: repr = dtype.str Is there a way (or an ongoing effort) to express the variety of data types in NumPy that beats the above (which seems somewhat inconsistent to me)? Thanks! -- Francesc Alted

On Wed, Feb 8, 2023 at 1:42 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
If you mean the array interface ( https://numpy.org/doc/stable/reference/arrays.interface.html), this is what dtype.str provides ( https://numpy.org/doc/stable/reference/generated/numpy.dtype.str.html). But the limitation here is that structured types are represented by the 'V' char, which is not properly representing it by any means.
-- Francesc Alted

On Wed, 2023-02-08 at 14:31 +0100, Francesc Alted wrote:
Ah, I was thinking of what the Python buffer protocol uses, which is what struct uses: https://docs.python.org/3/library/struct.html#module-struct That has some annoyances for sure, and structured dtypes with field names need rather strange syntax. Also I think padding bytes at best are simply fields with an empty name. But overall, it probably already does a better job than any `str()` for basic types: In [2]: import numpy as np In [3]: np.array(0, dtype="i,i,2f") Out[3]: array((0, 0, [0., 0.]), dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4', (2,))]) In [4]: memoryview(np.array(0, dtype="i,i,2f")).format Out[4]: 'T{i:f0:i:f1:(2)f:f2:}' - Sebastian

On Wed, Feb 8, 2023 at 3:19 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Aha, that's pretty cool, although I don't think this is flexible enough to support e.g. field names or nested fields. After pondering about it, I think we will add a format ID to our spec, and will stick with NumPy as the default. If in the future another format appears that is more well defined, we could still change the representation and use a new ID, while keeping backwards compatibility if needed. Thanks!
-- Francesc Alted

On Wed, 2023-02-08 at 17:08 +0100, Francesc Alted wrote:
It does support field names. I think the main problem may be that it cannot support e.g. datetimes. You probably also can't support empty field names (but maybe that is too weird to be useful anyway). Not ethat in the output there `T{i:f0}` denotes that it is structured and the name is "f0". And yes, you can nest another `T{}` inside. I am not sure whether type codes are limited to single characters (which IMO wouldn't be nice), which (to me) would seem a bit limited. I also think we may have to agree on e.g. an empty name denoting padding. - Sebastian

On Wed, Feb 8, 2023 at 5:28 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Right. I don't think empty field names are useful, but not supporting datetimes is a deal breaker for us.
Not ethat in the output there `T{i:f0}` denotes that it is structured and the name is "f0". And yes, you can nest another `T{}` inside.
Good to know.
While I agree that the buffer format is a good effort, I consider the NumPy representation pretty more complete (which makes sense, as it had to evolve following user's needs more closely). Still, I *hope* there will be an effort in standarizing the NumPy format a bit more formally in the future.
-- Francesc Alted

On Wed, Feb 8, 2023 at 1:42 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
If you mean the array interface ( https://numpy.org/doc/stable/reference/arrays.interface.html), this is what dtype.str provides ( https://numpy.org/doc/stable/reference/generated/numpy.dtype.str.html). But the limitation here is that structured types are represented by the 'V' char, which is not properly representing it by any means.
-- Francesc Alted

On Wed, 2023-02-08 at 14:31 +0100, Francesc Alted wrote:
Ah, I was thinking of what the Python buffer protocol uses, which is what struct uses: https://docs.python.org/3/library/struct.html#module-struct That has some annoyances for sure, and structured dtypes with field names need rather strange syntax. Also I think padding bytes at best are simply fields with an empty name. But overall, it probably already does a better job than any `str()` for basic types: In [2]: import numpy as np In [3]: np.array(0, dtype="i,i,2f") Out[3]: array((0, 0, [0., 0.]), dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4', (2,))]) In [4]: memoryview(np.array(0, dtype="i,i,2f")).format Out[4]: 'T{i:f0:i:f1:(2)f:f2:}' - Sebastian

On Wed, Feb 8, 2023 at 3:19 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Aha, that's pretty cool, although I don't think this is flexible enough to support e.g. field names or nested fields. After pondering about it, I think we will add a format ID to our spec, and will stick with NumPy as the default. If in the future another format appears that is more well defined, we could still change the representation and use a new ID, while keeping backwards compatibility if needed. Thanks!
-- Francesc Alted

On Wed, 2023-02-08 at 17:08 +0100, Francesc Alted wrote:
It does support field names. I think the main problem may be that it cannot support e.g. datetimes. You probably also can't support empty field names (but maybe that is too weird to be useful anyway). Not ethat in the output there `T{i:f0}` denotes that it is structured and the name is "f0". And yes, you can nest another `T{}` inside. I am not sure whether type codes are limited to single characters (which IMO wouldn't be nice), which (to me) would seem a bit limited. I also think we may have to agree on e.g. an empty name denoting padding. - Sebastian

On Wed, Feb 8, 2023 at 5:28 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
Right. I don't think empty field names are useful, but not supporting datetimes is a deal breaker for us.
Not ethat in the output there `T{i:f0}` denotes that it is structured and the name is "f0". And yes, you can nest another `T{}` inside.
Good to know.
While I agree that the buffer format is a good effort, I consider the NumPy representation pretty more complete (which makes sense, as it had to evolve following user's needs more closely). Still, I *hope* there will be an effort in standarizing the NumPy format a bit more formally in the future.
-- Francesc Alted
participants (2)
-
Francesc Alted
-
Sebastian Berg