Mailman 3 Proposal to accept NEP 55: Add a UTF-8 variable-width string DType to NumPy - NumPy-Discussion

23 Jan 2024

      Hi all,

I propose we accept NEP 55 and merge PR #25347 implementing the NEP in time
for the NumPy 2.0 RC:

https://numpy.org/neps/nep-0055-string_dtype.html
https://github.com/numpy/numpy/pull/25347

The most controversial aspect of the NEP was support for missing strings
via a user-supplied sentinel object. In the previous discussion on the
mailing list, Warren Weckesser argued for shipping a missing data sentinel
with NumPy for use with the DType, while in code review and the PR for the
NEP, Sebestian expressed concern about the additional complexity of
including missing data support at all.

I found that supporting missing data is key to efficiently supporting the
new DType in Pandas. I think that argues that we need some level of missing
data support to fully replace object string arrays. I believe the
compromise proposal in the NEP is sufficient for downstream libraries while
limiting additional complexity elsewhere in NumPy.

Concerns raised in previous discussions about concretely specifying the C
API to be made public, preventing use-after-free errors in a multithreaded
context, and uncertainty around the arena allocator implementation have
been resolved in the latest version of the NEP and the open PR.
Additionally, due to some excellent and timely work by Lysandros Nikolaou,
we now have a number of string ufuncs in NumPy and a straightforward plan
to add more. Loops have been implemented for all the ufuncs added in the
NumPy 2.0 dev cycle so far.

I would like to see us ship the DType in NumPy 2.0. This will allow us to
advertise a major new feature, will spur efforts to support new DTypes in
downstream libraries, and will allow us to get feedback from the community
that would be difficult to obtain without releasing the code into the wild.
Additionally, I am funded via a NASA ROSES grant for work related to this
effort until the end of 2024, so including the DType in NumPy 2.0 will more
efficiently use my funded time to fix issues.

If there are no substantive objections to this email, then the NEP will be
considered accepted; see NEP 0 for more details:
https://numpy.org/neps/nep-0000.html

Proposal to accept NEP 55: Add a UTF-8 variable-width string DType to NumPy

Nathan

Charles R Harris

Sebastian Berg

Ralf Gommers

tags

participants (4)