[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Sept. 20, 2023

      On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <warren.weckesser@gmail.com>
wrote:
...
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
...
...
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
...
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
...
...
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
...
...
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
...
Per NEP 0, I've copied everything up to the "detailed description"
...
...
...
...
...
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by
@rkern (and probably others) made in the 2017 mailing list thread;
see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
...
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and
haven't had a chance to look over this in detail yet, but at first glance
...
...
I'm going to try to integrate your proposed design into the dtype
wrote:
section below.
this seems like a really nice improvement.
prototype this week. If that works, I'd like to include some of the text
from the README in your repo in the NEP and add you as an author, would
that be alright?
...
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll finish
up and send this weekend.
...
One more comment on the NEP...
My first impression of the missing data API design is that
it is more complicated than necessary. An alternative that
is simpler--and is consistent with the pattern established for
floats and datetimes--is to define a "not a string" value, say
`np.nastring` or something similar, just like we have `nan` for
floats and `nat` for datetimes. Its behavior could be what
you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that
similar, and the latter was always a bit questionable (at least partially
it's a left-over of trying to introduce generic missing value support I
believe). `nan` is a float and part of C/C++ standards with well-defined
numerical behavior. In contrast, there is no `np.nat`; you can retrieve a
sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's
possible to generate a NaT value with a regular operation on a datetime
array a la `np.array([1.5]) / 0.0`.

The handling of `np.nastring` would be an intrinsic part of the
...
dtype, so there would be no need for the `na_object` parameter
of `StringDType`. All `StringDType`s would handle `np.nastring`
in the same consistent manner.
The use-case for the string sentinel does not seem very
compelling (but maybe I just don't understand the use-cases).
If there is a real need here that is not covered by
`np.nastring`, perhaps just a flag to control the repr of
`np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support
to allow Pandas to adopt the new dtype. The scope section of the NEP says:
"Fully agreeing on the semantics of a missing data sentinels or adding a
missing data sentinel to NumPy itself.". And then further down:
"By only supporting user-provided missing data sentinels, we avoid
resolving exactly how NumPy itself should support missing data and the
correct semantics of the missing data object, leaving that up to users to
decide"

That general approach I agree with, it's a large can of worms and not the
main purpose of this NEP. Nathan may have more thoughts about what, if
anything, from your suggestions could be adopted, but the general "let's
introduce a missing value thing" is a path we should not go down here imho.
...
If there is an objection to a potential proliferation of
"not a thing" special values, one for each type that can
handle them, then perhaps a generic "not a value" (say
`np.navalue`) could be created that, when assigned to an
element of an array, results in the appropriate "not a thing"
value actually being assigned. In a sense, I guess this NEP is
proposing that, but it is reusing the floating point object
`np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to
provide their preferred sentinel. You're probably referring to the example
with `na_object=np.nan`, but that example would work with another sentinel
value too.

Cheers,
Ralf
...
, and my preference
is that, *if* we go with such a generic object, it is not
the floating point value `nan` but a new thing with a name
that reflects its purpose. (I guess Pandas users might be
accustomed to `nan` being a generic sentinel for missing data,
so its use doesn't feel as incohesive as it might to others.
Passing a string array to `np.isnan()` just feels *wrong* to
me.)
Any, that's my 2¢.
Warren
...
Warren
...
...
There is a design change that I think should be made in the
implementation of missing values.
In the current design described in the NEP, and expanded on in the
comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of
`npy_static_string` depends on whether or not the `na_object` has been
set in the dtype. If it has not been set, that data represents a string
of length 0. If `na_object` *has* been set, that data represents a
missing value. To get a string of length 0 in this case, some non-NULL
value must be assigned to the `buf` field. (In the comment linked
above, @ngoldbaum suggested `{0, "\0"}`, but strings are not
NUL-terminated, so there is no need for that `\0` in `buf`, and in
fact,
...
...
...
with `len == 0`, it would be a bug for the pointer to be dereferenced,
so *any* non-NULL value--valid pointer or not--could be used for
`buf`.)
I think it would be better if `len == 0` *always* meant a string with
length 0, with no additional qualifications; it shouldn't be necessary
to put some non-NULL value in `buf` just to get an empty string. We
can achieve this if we use a bit in `len` as a flag for a missing
value.
Reserving a bit from `len` as a flag reduces the maximum possible
string
length, but as discussed in the NEP pull request, we're almost
certainly
going to reserve at least the high bit of `len` when small string
optimization (SSO) is implemented. This will reduce the maximum string
length to `2**(N-1)-1`, where `N` is the bit width of `size_t`
(equivalent to using a signed type for `len`). Even if SSO isn't
implemented immediately, we can anticipate the need for flags stored
in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work,
because the offset of the most significant byte of `len` within the
`npy_static_string` struct depends on the platform endianess. For
little-endian, the most significant byte is not the first byte in the
struct, so the bytes available for SSO within the struct are not
contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
...
The idea that I propose there is to make the memory layout of the
struct depend on the endianess of the platform, so the most
significant byte of `len` (which I called `size`, to avoid any chance
of confusion with the actual length of the string [1]) is at the
beginning of the struct on big-endian platforms and at the end of the
struct for little-endian platforms. More details are included in the
file README.md. Note that I am not suggesting that all the SSO stuff
be included in the current NEP! This is just a proof-of-concept that
shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set
indicates that the `npy_static_string` struct should not be interpreted
as the standard `{len, buf}` representation of a string. When the
second highest bit is set, it means we have a missing value. If the
second highest bit is not set, SSO is active; see the link above for
more details.
With this design, `len == 0` *always* means a string of length 0,
regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will
automatically be an array of empty strings. With current design in
the NEP, an array created with `calloc()` will be either an array of
empty strings, or an array of missing values, depending on whether or
not the dtype has `na_object` defined. That conditional behavior
seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the
`npy_static_string` struct, but that's bikeshed stuff, and not
a blocker.
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: warren.weckesser@gmail.com

NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ralf.gommers@googlemail.com

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Ralf Gommers