[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Feb. 12, 2024

      Hi,

I know that I'm a little late to be asking about this, but I don't see a
comment elsewhere on it (in the NEP, the implementation PR #25347, or this
email thread).

As I understand it, the new StringDType implementation distinguishes 3
types of individual strings, any of which can be present in an array:

   1. short strings, included inline in the array (at most 15 bytes on a
   64-bit system)
   2. arena-allocated strings, which are managed by the npy_string_allocator
   3. heap-allocated strings, which are pointers anywhere in RAM.

Does case 3 include strings that are passed to the array as views, without
copying? If so, then the ownership of strings would either need to be
tracked on a per-string basis (distinct from the array_owned boolean, which
characterizes the whole array), or they need to all be considered stolen
references (NumPy will free all of them when the array goes out of scope),
or they all need to be considered borrowed references (NumPy will not free
any of them when the array goes out of scope).

If the array does not accept new strings as views, but always copies any
externally provided string, then why distinguish between cases 2 and 3? How
would an array end up with some strings being arena-allocated and other
strings being heap-allocated?

Thanks!
-- Jim

On Wed, Sep 20, 2023 at 10:25 AM Nathan <nathan.goldbaum@gmail.com> wrote:
...
On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <kevin.k.sheppard@gmail.com>
wrote:
...
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gommers@gmail.com>
wrote:
...
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
...
...
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
>
>
>
> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
> >
> > The NEP was merged in draft form, see below.
> >
> > https://numpy.org/neps/nep-0055-string_dtype.html
> >
> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
> >>
> >> Hello all,
> >>
> >> I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
> >>
> >> Per NEP 0, I've copied everything up to the "detailed
description" section below.
> >>
> >> I'm looking forward to your feedback on this.
> >>
> >> -Nathan Goldbaum
> >>
>
> This will be a nice addition to NumPy, and matches a suggestion by
> @rkern (and probably others) made in the 2017 mailing list thread;
> see the last bullet of
>
>
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
>
> So +1 for the enhancement!
>
> Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and
haven't had a chance to look over this in detail yet, but at first glance
...
...
I'm going to try to integrate your proposed design into the dtype
wrote:
this seems like a really nice improvement.
prototype this week. If that works, I'd like to include some of the text
from the README in your repo in the NEP and add you as an author, would
that be alright?
...
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll
finish up and send this weekend.
...
One more comment on the NEP...
My first impression of the missing data API design is that
it is more complicated than necessary. An alternative that
is simpler--and is consistent with the pattern established for
floats and datetimes--is to define a "not a string" value, say
`np.nastring` or something similar, just like we have `nan` for
floats and `nat` for datetimes. Its behavior could be what
you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that
similar, and the latter was always a bit questionable (at least partially
it's a left-over of trying to introduce generic missing value support I
believe). `nan` is a float and part of C/C++ standards with well-defined
numerical behavior. In contrast, there is no `np.nat`; you can retrieve a
sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's
possible to generate a NaT value with a regular operation on a datetime
array a la `np.array([1.5]) / 0.0`.
The handling of `np.nastring` would be an intrinsic part of the
...
dtype, so there would be no need for the `na_object` parameter
of `StringDType`. All `StringDType`s would handle `np.nastring`
in the same consistent manner.
The use-case for the string sentinel does not seem very
compelling (but maybe I just don't understand the use-cases).
If there is a real need here that is not covered by
`np.nastring`, perhaps just a flag to control the repr of
`np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited
support to allow Pandas to adopt the new dtype. The scope section of the
NEP says: "Fully agreeing on the semantics of a missing data sentinels or
adding a missing data sentinel to NumPy itself.". And then further down:
"By only supporting user-provided missing data sentinels, we avoid
resolving exactly how NumPy itself should support missing data and the
correct semantics of the missing data object, leaving that up to users to
decide"
That general approach I agree with, it's a large can of worms and not
the main purpose of this NEP. Nathan may have more thoughts about what, if
anything, from your suggestions could be adopted, but the general "let's
introduce a missing value thing" is a path we should not go down here imho.
...
If there is an objection to a potential proliferation of
"not a thing" special values, one for each type that can
handle them, then perhaps a generic "not a value" (say
`np.navalue`) could be created that, when assigned to an
element of an array, results in the appropriate "not a thing"
value actually being assigned. In a sense, I guess this NEP is
proposing that, but it is reusing the floating point object
`np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to
provide their preferred sentinel. You're probably referring to the example
with `na_object=np.nan`, but that example would work with another sentinel
value too.
Cheers,
Ralf
...
, and my preference
is that, *if* we go with such a generic object, it is not
the floating point value `nan` but a new thing with a name
that reflects its purpose. (I guess Pandas users might be
accustomed to `nan` being a generic sentinel for missing data,
so its use doesn't feel as incohesive as it might to others.
Passing a string array to `np.isnan()` just feels *wrong* to
me.)
Any, that's my 2¢.
Warren
I was a bit surprised that len was not used as part of the missing
value.  The NEP proposal that 0 is a empty string unless there is a
sentinal in which case it is a missing value feels pretty limiting, since
these are distinctly different things.
Would it make sense for len<0 to indicate a missing value.  This would
require using ssize_t instead of size_t, and would then limit the string
size. In principle this would allow for sizeof(ssize_t) / 2 distinct
missing value.  I think ssize_t is well-defined on all platforms
targeted by NumPy.
Kevin
Hey Kevin,
Thanks for the comment. Right now the current NEP text is a little out of
date compared to the implementation. I've since rewritten it to use
Warren's proposal more or less verbatim, so now the missing value flag is
stored in a bit of the size field
See https://github.com/numpy/numpy-user-dtypes/pull/86 for the
implementation, which also includes a small string optimization
implementation.
...
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: jpivarski@gmail.com

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Jim Pivarski