>
> Warren
>
>>
>>
>>>
>>>
>>> There is a design change that I think should be made in the
>>> implementation of missing values.
>>>
>>> In the current design described in the NEP, and expanded on in the
>>> comment
>>>
>>>
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
>>>
>>> the meaning of the values `{len = 0, buf = NULL}` in an instance of
>>> `npy_static_string` depends on whether or not the `na_object` has been
>>> set in the dtype. If it has not been set, that data represents a string
>>> of length 0. If `na_object` *has* been set, that data represents a
>>> missing value. To get a string of length 0 in this case, some non-NULL
>>> value must be assigned to the `buf` field. (In the comment linked
>>> above, @ngoldbaum suggested `{0, "\0"}`, but strings are not
>>> NUL-terminated, so there is no need for that `\0` in `buf`, and in fact,
>>> with `len == 0`, it would be a bug for the pointer to be dereferenced,
>>> so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
>>>
>>> I think it would be better if `len == 0` *always* meant a string with
>>> length 0, with no additional qualifications; it shouldn't be necessary
>>> to put some non-NULL value in `buf` just to get an empty string. We
>>> can achieve this if we use a bit in `len` as a flag for a missing value.
>>> Reserving a bit from `len` as a flag reduces the maximum possible string
>>> length, but as discussed in the NEP pull request, we're almost certainly
>>> going to reserve at least the high bit of `len` when small string
>>> optimization (SSO) is implemented. This will reduce the maximum string
>>> length to `2**(N-1)-1`, where `N` is the bit width of `size_t`
>>> (equivalent to using a signed type for `len`). Even if SSO isn't
>>> implemented immediately, we can anticipate the need for flags stored
>>> in `len`, and use them to implement missing values.
>>>
>>> The actual implementation of SSO will require some more design work,
>>> because the offset of the most significant byte of `len` within the
>>> `npy_static_string` struct depends on the platform endianess. For
>>> little-endian, the most significant byte is not the first byte in the
>>> struct, so the bytes available for SSO within the struct are not
>>> contiguous when the fields have the order `{len, buf}`.
>>>
>>> I experimented with these ideas, and put the result at
>>>
>>>
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring>>>
>>> The idea that I propose there is to make the memory layout of the
>>> struct depend on the endianess of the platform, so the most
>>> significant byte of `len` (which I called `size`, to avoid any chance
>>> of confusion with the actual length of the string [1]) is at the
>>> beginning of the struct on big-endian platforms and at the end of the
>>> struct for little-endian platforms. More details are included in the
>>> file README.md. Note that I am not suggesting that all the SSO stuff
>>> be included in the current NEP! This is just a proof-of-concept that
>>> shows one possibility for SSO.
>>>
>>> In that design, the high bit of `size` (which is `len` here) being set
>>> indicates that the `npy_static_string` struct should not be interpreted
>>> as the standard `{len, buf}` representation of a string. When the
>>> second highest bit is set, it means we have a missing value. If the
>>> second highest bit is not set, SSO is active; see the link above for
>>> more details.
>>>
>>> With this design, `len == 0` *always* means a string of length 0,
>>> regardless of whether or not `na_object` is defined in the dtype.
>>>
>>> Also with this design, an array created with `calloc()` will
>>> automatically be an array of empty strings. With current design in
>>> the NEP, an array created with `calloc()` will be either an array of
>>> empty strings, or an array of missing values, depending on whether or
>>> not the dtype has `na_object` defined. That conditional behavior
>>> seems less than desirable.
>>>
>>> What do you think?
>>>
>>> --Warren
>>>
>>> [1] I would like to see `len` renamed to `size` in the
>>> `npy_static_string` struct, but that's bikeshed stuff, and not
>>> a blocker.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list --
numpy-discussion@python.org>>> To unsubscribe send an email to
numpy-discussion-leave@python.org>>>
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>>> Member address:
nathan12343@gmail.com>>
>> _______________________________________________
>> NumPy-Discussion mailing list --
numpy-discussion@python.org>> To unsubscribe send an email to
numpy-discussion-leave@python.org>>
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>> Member address:
warren.weckesser@gmail.com
_______________________________________________