Proposal to accept NEP 55: Add a UTF-8 variable-width string DType to NumPy
Hi all, I propose we accept NEP 55 and merge PR #25347 implementing the NEP in time for the NumPy 2.0 RC: https://numpy.org/neps/nep-0055-string_dtype.html https://github.com/numpy/numpy/pull/25347 The most controversial aspect of the NEP was support for missing strings via a user-supplied sentinel object. In the previous discussion on the mailing list, Warren Weckesser argued for shipping a missing data sentinel with NumPy for use with the DType, while in code review and the PR for the NEP, Sebestian expressed concern about the additional complexity of including missing data support at all. I found that supporting missing data is key to efficiently supporting the new DType in Pandas. I think that argues that we need some level of missing data support to fully replace object string arrays. I believe the compromise proposal in the NEP is sufficient for downstream libraries while limiting additional complexity elsewhere in NumPy. Concerns raised in previous discussions about concretely specifying the C API to be made public, preventing use-after-free errors in a multithreaded context, and uncertainty around the arena allocator implementation have been resolved in the latest version of the NEP and the open PR. Additionally, due to some excellent and timely work by Lysandros Nikolaou, we now have a number of string ufuncs in NumPy and a straightforward plan to add more. Loops have been implemented for all the ufuncs added in the NumPy 2.0 dev cycle so far. I would like to see us ship the DType in NumPy 2.0. This will allow us to advertise a major new feature, will spur efforts to support new DTypes in downstream libraries, and will allow us to get feedback from the community that would be difficult to obtain without releasing the code into the wild. Additionally, I am funded via a NASA ROSES grant for work related to this effort until the end of 2024, so including the DType in NumPy 2.0 will more efficiently use my funded time to fix issues. If there are no substantive objections to this email, then the NEP will be considered accepted; see NEP 0 for more details: https://numpy.org/neps/nep-0000.html
On Mon, Jan 22, 2024 at 5:14 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hi all,
I propose we accept NEP 55 and merge PR #25347 implementing the NEP in time for the NumPy 2.0 RC:
https://numpy.org/neps/nep-0055-string_dtype.html https://github.com/numpy/numpy/pull/25347
The most controversial aspect of the NEP was support for missing strings via a user-supplied sentinel object. In the previous discussion on the mailing list, Warren Weckesser argued for shipping a missing data sentinel with NumPy for use with the DType, while in code review and the PR for the NEP, Sebestian expressed concern about the additional complexity of including missing data support at all.
I found that supporting missing data is key to efficiently supporting the new DType in Pandas. I think that argues that we need some level of missing data support to fully replace object string arrays. I believe the compromise proposal in the NEP is sufficient for downstream libraries while limiting additional complexity elsewhere in NumPy.
Concerns raised in previous discussions about concretely specifying the C API to be made public, preventing use-after-free errors in a multithreaded context, and uncertainty around the arena allocator implementation have been resolved in the latest version of the NEP and the open PR. Additionally, due to some excellent and timely work by Lysandros Nikolaou, we now have a number of string ufuncs in NumPy and a straightforward plan to add more. Loops have been implemented for all the ufuncs added in the NumPy 2.0 dev cycle so far.
I would like to see us ship the DType in NumPy 2.0. This will allow us to advertise a major new feature, will spur efforts to support new DTypes in downstream libraries, and will allow us to get feedback from the community that would be difficult to obtain without releasing the code into the wild. Additionally, I am funded via a NASA ROSES grant for work related to this effort until the end of 2024, so including the DType in NumPy 2.0 will more efficiently use my funded time to fix issues.
If there are no substantive objections to this email, then the NEP will be considered accepted; see NEP 0 for more details: https://numpy.org/neps/nep-0000.html
Don't worry too much about the timing, we aren't going to branch without the new strings unless the cat gets into them, which is unlikely. Chuck
On Mon, 2024-01-22 at 17:08 -0700, Nathan wrote:
Hi all,
I propose we accept NEP 55 and merge PR #25347 implementing the NEP in time for the NumPy 2.0 RC:
I really like this work and I think it is a big improvement! At this point we probably have to expect some things to be still buggy, but that is also a reason to get it in (testing is hard if it isn't shipped first-class unfortunately). Nathan summarized the things I might have brought up very well. The support of missing values is the one thing that to me may end up a bit more in flux. But I am happy to hope that this is in a way that pandas will not be affected and, honestly, without deep integration testing we won't make progress in figuring out whether there is some change needed or not. Thanks for the great work! - Sebastian
https://numpy.org/neps/nep-0055-string_dtype.html https://github.com/numpy/numpy/pull/25347
The most controversial aspect of the NEP was support for missing strings via a user-supplied sentinel object. In the previous discussion on the mailing list, Warren Weckesser argued for shipping a missing data sentinel with NumPy for use with the DType, while in code review and the PR for the NEP, Sebestian expressed concern about the additional complexity of including missing data support at all.
I found that supporting missing data is key to efficiently supporting the new DType in Pandas. I think that argues that we need some level of missing data support to fully replace object string arrays. I believe the compromise proposal in the NEP is sufficient for downstream libraries while limiting additional complexity elsewhere in NumPy.
Concerns raised in previous discussions about concretely specifying the C API to be made public, preventing use-after-free errors in a multithreaded context, and uncertainty around the arena allocator implementation have been resolved in the latest version of the NEP and the open PR. Additionally, due to some excellent and timely work by Lysandros Nikolaou, we now have a number of string ufuncs in NumPy and a straightforward plan to add more. Loops have been implemented for all the ufuncs added in the NumPy 2.0 dev cycle so far.
I would like to see us ship the DType in NumPy 2.0. This will allow us to advertise a major new feature, will spur efforts to support new DTypes in downstream libraries, and will allow us to get feedback from the community that would be difficult to obtain without releasing the code into the wild. Additionally, I am funded via a NASA ROSES grant for work related to this effort until the end of 2024, so including the DType in NumPy 2.0 will more efficiently use my funded time to fix issues.
If there are no substantive objections to this email, then the NEP will be considered accepted; see NEP 0 for more details: https://numpy.org/neps/nep-0000.html _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
On Wed, Jan 24, 2024 at 10:43 AM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Mon, 2024-01-22 at 17:08 -0700, Nathan wrote:
Hi all,
I propose we accept NEP 55 and merge PR #25347 implementing the NEP in time for the NumPy 2.0 RC:
I really like this work and I think it is a big improvement! At this point we probably have to expect some things to be still buggy, but that is also a reason to get it in (testing is hard if it isn't shipped first-class unfortunately).
+1 to this. It's seen a ton of hard and careful work for about a year now, and seems close to as ready as it's going to get pre-merging. So +1 to accepting the NEP now and hitting the green button on your main PR. Cheers, Ralf Nathan summarized the things I might have brought up very well. The
support of missing values is the one thing that to me may end up a bit more in flux. But I am happy to hope that this is in a way that pandas will not be affected and, honestly, without deep integration testing we won't make progress in figuring out whether there is some change needed or not.
Thanks for the great work!
- Sebastian
https://numpy.org/neps/nep-0055-string_dtype.html https://github.com/numpy/numpy/pull/25347
The most controversial aspect of the NEP was support for missing strings via a user-supplied sentinel object. In the previous discussion on the mailing list, Warren Weckesser argued for shipping a missing data sentinel with NumPy for use with the DType, while in code review and the PR for the NEP, Sebestian expressed concern about the additional complexity of including missing data support at all.
I found that supporting missing data is key to efficiently supporting the new DType in Pandas. I think that argues that we need some level of missing data support to fully replace object string arrays. I believe the compromise proposal in the NEP is sufficient for downstream libraries while limiting additional complexity elsewhere in NumPy.
Concerns raised in previous discussions about concretely specifying the C API to be made public, preventing use-after-free errors in a multithreaded context, and uncertainty around the arena allocator implementation have been resolved in the latest version of the NEP and the open PR. Additionally, due to some excellent and timely work by Lysandros Nikolaou, we now have a number of string ufuncs in NumPy and a straightforward plan to add more. Loops have been implemented for all the ufuncs added in the NumPy 2.0 dev cycle so far.
I would like to see us ship the DType in NumPy 2.0. This will allow us to advertise a major new feature, will spur efforts to support new DTypes in downstream libraries, and will allow us to get feedback from the community that would be difficult to obtain without releasing the code into the wild. Additionally, I am funded via a NASA ROSES grant for work related to this effort until the end of 2024, so including the DType in NumPy 2.0 will more efficiently use my funded time to fix issues.
If there are no substantive objections to this email, then the NEP will be considered accepted; see NEP 0 for more details: https://numpy.org/neps/nep-0000.html _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: sebastian@sipsolutions.net
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com
participants (4)
-
Charles R Harris
-
Nathan
-
Ralf Gommers
-
Sebastian Berg