There is a design change that I think should be made in the
implementation of missing values.
In the current design described in the NEP, and expanded on in the
comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of
`npy_static_string` depends on whether or not the `na_object` has been
set in the dtype. If it has not been set, that data represents a string
of length 0. If `na_object` *has* been set, that data represents a
missing value. To get a string of length 0 in this case, some non-NULL
value must be assigned to the `buf` field. (In the comment linked
above, @ngoldbaum suggested `{0, "\0"}`, but strings are not
NUL-terminated, so there is no need for that `\0` in `buf`, and in fact,
with `len == 0`, it would be a bug for the pointer to be dereferenced,
so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with
length 0, with no additional qualifications; it shouldn't be necessary
to put some non-NULL value in `buf` just to get an empty string. We
can achieve this if we use a bit in `len` as a flag for a missing value.
Reserving a bit from `len` as a flag reduces the maximum possible string
length, but as discussed in the NEP pull request, we're almost certainly
going to reserve at least the high bit of `len` when small string
optimization (SSO) is implemented. This will reduce the maximum string
length to `2**(N-1)-1`, where `N` is the bit width of `size_t`
(equivalent to using a signed type for `len`). Even if SSO isn't
implemented immediately, we can anticipate the need for flags stored
in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work,
because the offset of the most significant byte of `len` within the
`npy_static_string` struct depends on the platform endianess. For
little-endian, the most significant byte is not the first byte in the
struct, so the bytes available for SSO within the struct are not
contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstringThe idea that I propose there is to make the memory layout of the
struct depend on the endianess of the platform, so the most
significant byte of `len` (which I called `size`, to avoid any chance
of confusion with the actual length of the string [1]) is at the
beginning of the struct on big-endian platforms and at the end of the
struct for little-endian platforms. More details are included in the
file README.md. Note that I am not suggesting that all the SSO stuff
be included in the current NEP! This is just a proof-of-concept that
shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set
indicates that the `npy_static_string` struct should not be interpreted
as the standard `{len, buf}` representation of a string. When the
second highest bit is set, it means we have a missing value. If the
second highest bit is not set, SSO is active; see the link above for
more details.
With this design, `len == 0` *always* means a string of length 0,
regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will
automatically be an array of empty strings. With current design in
the NEP, an array created with `calloc()` will be either an array of
empty strings, or an array of missing values, depending on whether or
not the dtype has `na_object` defined. That conditional behavior
seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the
`npy_static_string` struct, but that's bikeshed stuff, and not
a blocker.
_______________________________________________