NEP 55 Updates and call for testing
Hi all, This week I updated NEP 55 to reflect the changes I made to the prototype since I initially sent out the NEP. The updated NEP is available on the NumPy website: https://numpy.org/neps/nep-0055-string_dtype.html. Updates to the NEP ++++++++++++++++++ The changes since the original version of the NEP focus on fully defining the C API surface we would like to add to the NumPy C API and an implementation of a per-dtype-instance arena allocator to manage heap allocations. This enabled major improvements to the prototype, including implementing the small string optimization and locking all access to heap memory behind a fine-grained mutex which should prevent seg faults or memory corruption in a multithreaded context. Thanks to Warren Weckesser for his proof of concept code and help with the small string optimization implementation, he has been added as an author to reflect his contributions. With these changes the stringdtype prototype is feature complete. Call to Review NEP 55 +++++++++++++++++++++ I'm requesting another round of review on the NEP with an eye toward acceptance before the NumPy 2.0 release branch is created from main. If I can manage it, my plan is to have a pull request open that merges the stringdtype codebase into NumPy before the branch is created. That said, if we decide that we need more time, or if some issue comes up, I'm happy with this going into main after the NumPy 2.0 release branch is created. The most significant feedback we have not addressed from the last round of review was Warren's suggestion to add a default missing data sentinel to NumPy itself. For reasons outlined in the NEP and in my reply to Warren from earlier this year, we do not want to add a missing data singleton to NumPy, instead leaving it to users to choose the missing data semantics they prefer. Otherwise I believe the current draft addresses all outstanding feedback from the last round of review. Help me Test the Prototype! +++++++++++++++++++++++++++ If anyone has time and interest, I would also very much appreciate some testing and tire-kicking on the stringdtype prototype, available at https://github.com/numpy/numpy-user-dtypes. There is a README with build instructions here: https://github.com/numpy/numpy-user-dtypes/blob/main/stringdtype/README.md If you have a Python development environment with a C compiler, it should be straightforward to build, install, and test the prototype. Note that you must have `NUMPY_EXPERIMENTAL_DTYPE_API=1` set in your shell environment or via `os.environ` to import stringdtype without error. I'm particularly interested to hear experiences converting code to use stringdtype. This could be code using fixed-width strings in a situation where a variable-length string array makes more sense or code using object string arrays. Are there pain points that aren't discussed in the NEP or existing workflows that cannot be adapted to use stringdtype? As far as I'm aware there aren't, but more testing will help catch issues before we've stabilized everything. My fork of pandas might be a source of inspiration for porting an existing non-trivial codebase that used object string arrays: https://github.com/pandas-dev/pandas/compare/main...ngoldbaum:pandas:stringd... Thanks all for your time, attention, and help reviewing the NEP! -Nathan
I just opened a draft PR to include stringdtype in numpy: https://github.com/numpy/numpy/pull/25347 If you are interested in testing the new dtype but haven't had the chance yet, hopefully this should be easier to test. From a clone of the NumPy repo, doing: $ git fetch https://github.com/ngoldbaum/numpy stringdtype:stringdtype $ git checkout stringdtype $ git submodule update --init $ python -m pip install . should build and install a version of NumPy that includes stringdtype, importable as `np.dtypes.StringDType`. Note that this is based on numpy 2.0 dev, so if you need to use another package that depends on NumPy's ABI to test the dtype, you'll need to rebuild that project as well. I'll be continuing to work on this PR to finish integrating stringdtype into NumPy and write documentation. If anyone has any feedback on any aspect of the NEP or the stringdtype code please reply here, on github, or reach out to me privately. On Wed, Nov 22, 2023 at 1:22 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hi all,
This week I updated NEP 55 to reflect the changes I made to the prototype since I initially sent out the NEP. The updated NEP is available on the NumPy website: https://numpy.org/neps/nep-0055-string_dtype.html.
Updates to the NEP ++++++++++++++++++
The changes since the original version of the NEP focus on fully defining the C API surface we would like to add to the NumPy C API and an implementation of a per-dtype-instance arena allocator to manage heap allocations. This enabled major improvements to the prototype, including implementing the small string optimization and locking all access to heap memory behind a fine-grained mutex which should prevent seg faults or memory corruption in a multithreaded context. Thanks to Warren Weckesser for his proof of concept code and help with the small string optimization implementation, he has been added as an author to reflect his contributions.
With these changes the stringdtype prototype is feature complete.
Call to Review NEP 55 +++++++++++++++++++++
I'm requesting another round of review on the NEP with an eye toward acceptance before the NumPy 2.0 release branch is created from main. If I can manage it, my plan is to have a pull request open that merges the stringdtype codebase into NumPy before the branch is created. That said, if we decide that we need more time, or if some issue comes up, I'm happy with this going into main after the NumPy 2.0 release branch is created.
The most significant feedback we have not addressed from the last round of review was Warren's suggestion to add a default missing data sentinel to NumPy itself. For reasons outlined in the NEP and in my reply to Warren from earlier this year, we do not want to add a missing data singleton to NumPy, instead leaving it to users to choose the missing data semantics they prefer. Otherwise I believe the current draft addresses all outstanding feedback from the last round of review.
Help me Test the Prototype! +++++++++++++++++++++++++++
If anyone has time and interest, I would also very much appreciate some testing and tire-kicking on the stringdtype prototype, available at https://github.com/numpy/numpy-user-dtypes.
There is a README with build instructions here: https://github.com/numpy/numpy-user-dtypes/blob/main/stringdtype/README.md
If you have a Python development environment with a C compiler, it should be straightforward to build, install, and test the prototype. Note that you must have `NUMPY_EXPERIMENTAL_DTYPE_API=1` set in your shell environment or via `os.environ` to import stringdtype without error.
I'm particularly interested to hear experiences converting code to use stringdtype. This could be code using fixed-width strings in a situation where a variable-length string array makes more sense or code using object string arrays. Are there pain points that aren't discussed in the NEP or existing workflows that cannot be adapted to use stringdtype? As far as I'm aware there aren't, but more testing will help catch issues before we've stabilized everything.
My fork of pandas might be a source of inspiration for porting an existing non-trivial codebase that used object string arrays:
https://github.com/pandas-dev/pandas/compare/main...ngoldbaum:pandas:stringd...
Thanks all for your time, attention, and help reviewing the NEP!
-Nathan
Hi Nathan, thank you for your great work on UTF8 strings and their integration in Numpy. This is a very important dtype to support, especially with the widespread use of large language models (LLM) nowadays. However, I would like to comment on the serialization. Hope it's not too late at this point (the last time I looked I think the serialization was less detailed), but the approach with sidecar_size has the disadvantage that Numpy arrays would not be efficiently appendable anymore. Also, while it's comfortable to not need to specify how the sidecar data looks like, it makes the format more proprietary, less open and more difficult to debug. To be precise, I see this part of NEP 1 violated: Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation. On top of that, what if the data format for the sidecar data changes? Is it then still possible to read old files with a newer Numpy version? To overcome this issues, first and foremost I suggest not to introduce a .npy file format version 4.0 with this kind of fundamental change. I suggest to instead of adding the sidecar data to the same .npy file to add it to an extra, standard .npy file. Here an example: if the user wants to save to "mystrings.npy", the files "mystrings.npy" and "mystrings.npy.idx" could be created where latter is also just a regular .npy file and contains indices/offsets into mystrings.npy that would otherwise end up in the sidecar data. When the file "mystrings.npy" is loaded, Numpy checks (would check) whether there is a mystrings.npy.idx and tries to load it as well. This is an approach I use pretty regularly with video data: the single frames are all in one big (appendable) array and the index array contains the begin indices/offsets and (redundantly, for fast lookups, sort etc.) the lengths and end offsets. This concept is very generic, can be used for all sorts of ragged arrays including text, satisfies the requirements of NEP 1 and is efficiently appendable (or at least moves the burden from the programmer to the filesystem). Best, Michael PS Also, a dedicated index array can come with a custom shape, which would allow for multidimensional ragged arrays (in the future). Wouldn't that be great? PPS this is a copy of what I have written in the pull request. As Nathan said, the serialization part is not defined and/or implemented yet and I’d like to hear some more opinions about this.
On 8. Dec 2023, at 19:35, Nathan <nathan.goldbaum@gmail.com> wrote:
I just opened a draft PR to include stringdtype in numpy: https://github.com/numpy/numpy/pull/25347
If you are interested in testing the new dtype but haven't had the chance yet, hopefully this should be easier to test. From a clone of the NumPy repo, doing:
$ git fetch https://github.com/ngoldbaum/numpy stringdtype:stringdtype $ git checkout stringdtype $ git submodule update --init $ python -m pip install .
should build and install a version of NumPy that includes stringdtype, importable as `np.dtypes.StringDType`. Note that this is based on numpy 2.0 dev, so if you need to use another package that depends on NumPy's ABI to test the dtype, you'll need to rebuild that project as well.
I'll be continuing to work on this PR to finish integrating stringdtype into NumPy and write documentation.
If anyone has any feedback on any aspect of the NEP or the stringdtype code please reply here, on github, or reach out to me privately.
On Wed, Nov 22, 2023 at 1:22 PM Nathan <nathan.goldbaum@gmail.com <mailto:nathan.goldbaum@gmail.com>> wrote:
Hi all,
This week I updated NEP 55 to reflect the changes I made to the prototype since I initially sent out the NEP. The updated NEP is available on the NumPy website: https://numpy.org/neps/nep-0055-string_dtype.html.
Updates to the NEP ++++++++++++++++++
The changes since the original version of the NEP focus on fully defining the C API surface we would like to add to the NumPy C API and an implementation of a per-dtype-instance arena allocator to manage heap allocations. This enabled major improvements to the prototype, including implementing the small string optimization and locking all access to heap memory behind a fine-grained mutex which should prevent seg faults or memory corruption in a multithreaded context. Thanks to Warren Weckesser for his proof of concept code and help with the small string optimization implementation, he has been added as an author to reflect his contributions.
With these changes the stringdtype prototype is feature complete.
Call to Review NEP 55 +++++++++++++++++++++
I'm requesting another round of review on the NEP with an eye toward acceptance before the NumPy 2.0 release branch is created from main. If I can manage it, my plan is to have a pull request open that merges the stringdtype codebase into NumPy before the branch is created. That said, if we decide that we need more time, or if some issue comes up, I'm happy with this going into main after the NumPy 2.0 release branch is created.
The most significant feedback we have not addressed from the last round of review was Warren's suggestion to add a default missing data sentinel to NumPy itself. For reasons outlined in the NEP and in my reply to Warren from earlier this year, we do not want to add a missing data singleton to NumPy, instead leaving it to users to choose the missing data semantics they prefer. Otherwise I believe the current draft addresses all outstanding feedback from the last round of review.
Help me Test the Prototype! +++++++++++++++++++++++++++
If anyone has time and interest, I would also very much appreciate some testing and tire-kicking on the stringdtype prototype, available at https://github.com/numpy/numpy-user-dtypes.
There is a README with build instructions here: https://github.com/numpy/numpy-user-dtypes/blob/main/stringdtype/README.md
If you have a Python development environment with a C compiler, it should be straightforward to build, install, and test the prototype. Note that you must have `NUMPY_EXPERIMENTAL_DTYPE_API=1` set in your shell environment or via `os.environ` to import stringdtype without error.
I'm particularly interested to hear experiences converting code to use stringdtype. This could be code using fixed-width strings in a situation where a variable-length string array makes more sense or code using object string arrays. Are there pain points that aren't discussed in the NEP or existing workflows that cannot be adapted to use stringdtype? As far as I'm aware there aren't, but more testing will help catch issues before we've stabilized everything.
My fork of pandas might be a source of inspiration for porting an existing non-trivial codebase that used object string arrays:
https://github.com/pandas-dev/pandas/compare/main...ngoldbaum:pandas:stringd...
Thanks all for your time, attention, and help reviewing the NEP!
-Nathan
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: michael.siebert2k@gmail.com
participants (2)
-
Michael Siebert
-
Nathan