On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,
I just opened a pull request to add NEP 55, see
Per NEP 0, I've copied everything up to the "detailed description"
https://github.com/numpy/numpy/pull/24483. section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html So +1 for the enhancement! Now for some nitty-gritty review... There is a design change that I think should be made in the implementation of missing values. In the current design described in the NEP, and expanded on in the comment https://github.com/numpy/numpy/pull/24483#discussion_r1311815944, the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.) I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values. The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`. I experimented with these ideas, and put the result at https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO. In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details. With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype. Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable. What do you think? --Warren [1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.