NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
Hello all, I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. Per NEP 0, I've copied everything up to the "detailed description" section below. I'm looking forward to your feedback on this. -Nathan Goldbaum ========================================================= NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy ========================================================= :Author: Nathan Goldbaum <ngoldbaum@quansight.com> :Status: Draft :Type: Standards Track :Created: 2023-06-29 Abstract -------- We propose adding a new string data type to NumPy where each item in the array is an arbitrary length UTF-8 encoded string. This will enable performance, memory usage, and usability improvements for NumPy users, including: * Memory savings for workflows that currently use fixed-width strings and store primarily ASCII data or a mix of short and long strings in a single NumPy array. * Downstream libraries and users will be able to move away from object arrays currently used as a substitute for variable-length string arrays, unlocking performance improvements by avoiding passes over the data outside of NumPy. * A more intuitive user-facing API for working with arrays of Python strings, without a need to think about the in-memory array representation. Motivation and Scope -------------------- First, we will describe how the current state of support for string or string-like data in NumPy arose. Next, we will summarize the last major previous discussion about this topic. Finally, we will describe the scope of the proposed changes to NumPy as well as changes that are explicitly out of scope of this proposal. History of String Support in Numpy ********************************** Support in NumPy for textual data evolved organically in response to early user needs and then changes in the Python ecosystem. Support for strings was added to numpy to support users of the NumArray ``chararray`` type. Remnants of this are still visible in the NumPy API: string-related functionality lives in ``np.char``, to support the obsolete ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string DTypes. NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str `` type before Python 3 support was added to NumPy. The bytes DType makes the most sense when it is used to represent Python 2 strings or other null-terminated byte sequences. However, ignoring data after the first null character means the ``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so it is a poor match for generic bytestreams. The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for a straightforward implementation, but is inefficient for storing text that can be represented well using a one-byte ASCII or Latin-1 encoding. This was not a problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2 ``str`` DType (the current ``bytes_`` DType). With the arrival of Python 3 support in NumPy, the string DTypes were largely left alone due to backward compatibility concerns, although the unicode DType became the default DType for ``str`` data and the old ``string`` DType was renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal situation of shipping a data type originally intended for null-terminated bytestrings as the data type for *all* python ``bytes`` data, and a default string type with an in-memory representation that consumes four times as much memory as needed for ASCII or mostly-ASCII data. Problems with Fixed-Width Strings ********************************* Both existing string DTypes represent fixed-width sequences, allowing storage of the string data in the array buffer. This avoids adding out-of-band storage to NumPy, however, it makes for an awkward user interface. In particular, the maximum string size must be inferred by NumPy or estimated by the user before loading the data into a NumPy array or selecting an output DType for string operations. In the worst case, this requires an expensive pass over the full dataset to calculate the maximum length of an array element. It also wastes memory when array elements have varying lengths. Pathological cases where an array stores many short strings and a few very long strings are particularly bad for wasting memory. Downstream usage of string data in NumPy arrays has proven out the need for a variable-width string data type. In practice, most downstream users employ ``object`` arrays for this purpose. In particular, ``pandas`` has explicitly deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays to ``object`` arrays, and in the future may switch to only supporting string data via ``PyArrow``, which has native support for UTF-8 encoded variable-width string arrays [1]_. This is unfortunate, since `` object`` arrays have no type guarantees, necessitating expensive sanitization passes and operations using object arrays cannot release the GIL. Previous Discussions -------------------- The project last discussed this topic in depth in 2017, when Julian Taylor proposed a fixed-width text data type parameterized by an encoding [2]_. This started a wide-ranging discussion about pain points for working with string data in NumPy and possible ways forward. In the end, the discussion identified two use-cases that the current support for strings does a poor job of handling: * Loading or memory-mapping scientific datasets with unknown encoding, * Working with string data in a manner that allows transparent conversion between NumPy arrays and Python strings, including support for missing strings. As a result of this discussion, improving support for string data was added to the NumPy project roadmap [3]_, with an explicit call-out to add a DType better suited to memory-mapping bytes with any or no encoding, and a variable-width string DType that supports missing data to replace usages of object string arrays. Proposed work ------------- This NEP proposes adding ``StringDType``, a DType that stores variable-width heap-allocated strings in Numpy arrays, to replace downstream usages of the ``object`` DType for string data. This work will heavily leverage recent improvements in NumPy to improve support for user-defined DTypes, so we will also necessarily be working on the data type internals in NumPy. In particular, we propose to: * Add a new variable-length string DType to NumPy, targeting NumPy 2.0. * Work out issues related to adding a DType implemented using the experimental DType API to NumPy itself. * Support for a user-provided missing data sentinel. * A cleanup of ``np.char``, with the ufunc-like functions moved to a new namespace for functions and types related to string support. * An update to the ``npy`` and ``npz`` file formats to allow storage of arbitrary-length sidecar data. The following is out of scope for this work: * Changing DType inference for string data. * Adding a DType for memory-mapping text in unknown encodings or a DType that attempts to fix issues with the ``bytes_`` DType. * Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself. * Implement fast ufuncs or SIMD optimizations for string operations. While we're explicitly ruling out implementing these items as part of this work, adding a new string DType helps set up future work that does implement some of these items. If implemented this NEP will make it easier to add a new fixed-width text DType in the future by moving string operations into a long-term supported namespace. We are also proposing a memory layout that should be amenable to writing fast ufuncs and SIMD optimization in some cases, increasing the payoff for writing string operations as SIMD-optimized ufuncs in the future. While we are not proposing adding a missing data sentinel to NumPy, we are proposing adding support for an optional, user-provided missing data sentinel, so this does move NumPy a little closer to officially supporting missing data. We are attempting to avoid resolving the disagreement described in :ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a missing data sentinel or bitflag-based missing data support in the future. Usage and Impact ---------------- The DType is intended as a drop-in replacement for object string arrays. This means that we intend to support as many downstream usages of object string arrays as possible, including all supported NumPy functionality. Pandas is the obvious first user, and substantial work has already occurred to add support in a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be able to migrate to a DType with guarantees that the arrays contains only strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class support for variable-width UTF-8 encoded string datasets in HDF5. String data are heavily used in machine-learning workflows and downstream machine learning libraries will be able to leverage this new DType. Users who wish to load string data into NumPy and leverage NumPy features like fancy advanced indexing will have a natural choice that offers substantial memory savings over fixed-width unicode strings and better validation guarantees and overall integration with NumPy than object string arrays. Moving to a first-class string DType also removes the need to acquire the GIL during string operations, unlocking future optimizations that are impossible with object string arrays. Performance *********** Here we briefly describe preliminary performance measurements of the prototype version of ``StringDType`` we have implemented outside of NumPy using the experimental DType API. All benchmarks in this section were performed on a Dell XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy, Pandas, and the ``StringDType`` prototype were all compiled with meson release builds. Currently, the ``StringDType`` prototype has comparable performance with object arrays and fixed-width string arrays. One exception is array creation from python strings, performance is somewhat slower than object arrays and comparable to fixed-width unicode arrays:: In [1]: from stringdtype import StringDType In [2]: import numpy as np In [3]: data = [str(i) * 10 for i in range(100_000)] In [4]: %timeit arr_object = np.array(data, dtype=object) 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [6]: %timeit arr_strdtype = np.array(data, dtype=str) 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In this example, object DTypes are substantially faster because the objects in the ``data`` list can be directly interned in the array, while ``StrDType`` and ``StringDType`` need to copy the string data and ``StringDType`` needs to convert the data to UTF-8 and perform additional heap allocations outside the array buffer. In the future, if Python moves to a UTF-8 internal representation for strings, the string loading performance of ``StringDType`` should improve. String operations have similar performance:: In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit np.char.capitalize(arr_stringdtype) 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [9]: %timeit np.char.capitalize(arr_strdtype) 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) The poor performance here is a reflection of the slow iterator-based implementation of operations in ``np.char``. If we were to rewrite these operations as ufuncs, we could unlock substantial performance improvements. Using the example of the ``add`` ufunc, which we have implemented for the ``StringDType`` prototype:: In [10]: %timeit arr_object + arr_object 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: %timeit arr_stringdtype + arr_stringdtype 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) As described below, we have already updated a fork of Pandas to use a prototype version of ``StringDType``. This demonstrates the performance improvements available when data are already loaded into a NumPy array and are passed to a third-party library. Currently Pandas attempts to coerce all ``str`` data to ``object`` DType by default, and has to check and sanitize existing ``object `` arrays that are passed in. This requires a copy or pass over the data made unnecessary by first-class support for variable-width strings in both NumPy and Pandas:: In [13]: import pandas as pd In [14]: %timeit pd.Series(arr_stringdtype) 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [15]: %timeit pd.Series(arr_object) 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) We have also implemented a Pandas extension DType that uses ``StringDType`` under the hood, which is also substantially faster for creating Pandas data structures than the existing Pandas string DType that uses ``object`` arrays:: In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [17]: %timeit pd.Series(arr_object, dtype='string[python]') 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Backward compatibility ---------------------- We are not proposing a change to DType inference for python strings and do not expect to see any impacts on existing usages of NumPy, besides warnings or errors related to new deprecations or expiring deprecations in ``np.char``.
The NEP was merged in draft form, see below. https://numpy.org/neps/nep-0055-string_dtype.html On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,
I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description" section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
========================================================= NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy =========================================================
:Author: Nathan Goldbaum <ngoldbaum@quansight.com> :Status: Draft :Type: Standards Track :Created: 2023-06-29
Abstract --------
We propose adding a new string data type to NumPy where each item in the array is an arbitrary length UTF-8 encoded string. This will enable performance, memory usage, and usability improvements for NumPy users, including:
* Memory savings for workflows that currently use fixed-width strings and store primarily ASCII data or a mix of short and long strings in a single NumPy array.
* Downstream libraries and users will be able to move away from object arrays currently used as a substitute for variable-length string arrays, unlocking performance improvements by avoiding passes over the data outside of NumPy.
* A more intuitive user-facing API for working with arrays of Python strings, without a need to think about the in-memory array representation.
Motivation and Scope --------------------
First, we will describe how the current state of support for string or string-like data in NumPy arose. Next, we will summarize the last major previous discussion about this topic. Finally, we will describe the scope of the proposed changes to NumPy as well as changes that are explicitly out of scope of this proposal.
History of String Support in Numpy **********************************
Support in NumPy for textual data evolved organically in response to early user needs and then changes in the Python ecosystem.
Support for strings was added to numpy to support users of the NumArray ``chararray`` type. Remnants of this are still visible in the NumPy API: string-related functionality lives in ``np.char``, to support the obsolete ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string DTypes.
NumPy's ``bytes_`` DType was originally used to represent the Python 2 `` str`` type before Python 3 support was added to NumPy. The bytes DType makes the most sense when it is used to represent Python 2 strings or other null-terminated byte sequences. However, ignoring data after the first null character means the ``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so it is a poor match for generic bytestreams.
The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for a straightforward implementation, but is inefficient for storing text that can be represented well using a one-byte ASCII or Latin-1 encoding. This was not a problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2 ``str`` DType (the current ``bytes_`` DType).
With the arrival of Python 3 support in NumPy, the string DTypes were largely left alone due to backward compatibility concerns, although the unicode DType became the default DType for ``str`` data and the old ``string`` DType was renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal situation of shipping a data type originally intended for null-terminated bytestrings as the data type for *all* python ``bytes`` data, and a default string type with an in-memory representation that consumes four times as much memory as needed for ASCII or mostly-ASCII data.
Problems with Fixed-Width Strings *********************************
Both existing string DTypes represent fixed-width sequences, allowing storage of the string data in the array buffer. This avoids adding out-of-band storage to NumPy, however, it makes for an awkward user interface. In particular, the maximum string size must be inferred by NumPy or estimated by the user before loading the data into a NumPy array or selecting an output DType for string operations. In the worst case, this requires an expensive pass over the full dataset to calculate the maximum length of an array element. It also wastes memory when array elements have varying lengths. Pathological cases where an array stores many short strings and a few very long strings are particularly bad for wasting memory.
Downstream usage of string data in NumPy arrays has proven out the need for a variable-width string data type. In practice, most downstream users employ ``object`` arrays for this purpose. In particular, ``pandas`` has explicitly deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays to ``object`` arrays, and in the future may switch to only supporting string data via ``PyArrow``, which has native support for UTF-8 encoded variable-width string arrays [1]_. This is unfortunate, since `` object`` arrays have no type guarantees, necessitating expensive sanitization passes and operations using object arrays cannot release the GIL.
Previous Discussions --------------------
The project last discussed this topic in depth in 2017, when Julian Taylor proposed a fixed-width text data type parameterized by an encoding [2]_. This started a wide-ranging discussion about pain points for working with string data in NumPy and possible ways forward.
In the end, the discussion identified two use-cases that the current support for strings does a poor job of handling:
* Loading or memory-mapping scientific datasets with unknown encoding, * Working with string data in a manner that allows transparent conversion between NumPy arrays and Python strings, including support for missing strings.
As a result of this discussion, improving support for string data was added to the NumPy project roadmap [3]_, with an explicit call-out to add a DType better suited to memory-mapping bytes with any or no encoding, and a variable-width string DType that supports missing data to replace usages of object string arrays.
Proposed work -------------
This NEP proposes adding ``StringDType``, a DType that stores variable-width heap-allocated strings in Numpy arrays, to replace downstream usages of the ``object`` DType for string data. This work will heavily leverage recent improvements in NumPy to improve support for user-defined DTypes, so we will also necessarily be working on the data type internals in NumPy. In particular, we propose to:
* Add a new variable-length string DType to NumPy, targeting NumPy 2.0.
* Work out issues related to adding a DType implemented using the experimental DType API to NumPy itself.
* Support for a user-provided missing data sentinel.
* A cleanup of ``np.char``, with the ufunc-like functions moved to a new namespace for functions and types related to string support.
* An update to the ``npy`` and ``npz`` file formats to allow storage of arbitrary-length sidecar data.
The following is out of scope for this work:
* Changing DType inference for string data.
* Adding a DType for memory-mapping text in unknown encodings or a DType that attempts to fix issues with the ``bytes_`` DType.
* Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.
* Implement fast ufuncs or SIMD optimizations for string operations.
While we're explicitly ruling out implementing these items as part of this work, adding a new string DType helps set up future work that does implement some of these items.
If implemented this NEP will make it easier to add a new fixed-width text DType in the future by moving string operations into a long-term supported namespace. We are also proposing a memory layout that should be amenable to writing fast ufuncs and SIMD optimization in some cases, increasing the payoff for writing string operations as SIMD-optimized ufuncs in the future.
While we are not proposing adding a missing data sentinel to NumPy, we are proposing adding support for an optional, user-provided missing data sentinel, so this does move NumPy a little closer to officially supporting missing data. We are attempting to avoid resolving the disagreement described in :ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a missing data sentinel or bitflag-based missing data support in the future.
Usage and Impact ----------------
The DType is intended as a drop-in replacement for object string arrays. This means that we intend to support as many downstream usages of object string arrays as possible, including all supported NumPy functionality. Pandas is the obvious first user, and substantial work has already occurred to add support in a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be able to migrate to a DType with guarantees that the arrays contains only strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class support for variable-width UTF-8 encoded string datasets in HDF5. String data are heavily used in machine-learning workflows and downstream machine learning libraries will be able to leverage this new DType.
Users who wish to load string data into NumPy and leverage NumPy features like fancy advanced indexing will have a natural choice that offers substantial memory savings over fixed-width unicode strings and better validation guarantees and overall integration with NumPy than object string arrays. Moving to a first-class string DType also removes the need to acquire the GIL during string operations, unlocking future optimizations that are impossible with object string arrays.
Performance ***********
Here we briefly describe preliminary performance measurements of the prototype version of ``StringDType`` we have implemented outside of NumPy using the experimental DType API. All benchmarks in this section were performed on a Dell XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy, Pandas, and the ``StringDType`` prototype were all compiled with meson release builds.
Currently, the ``StringDType`` prototype has comparable performance with object arrays and fixed-width string arrays. One exception is array creation from python strings, performance is somewhat slower than object arrays and comparable to fixed-width unicode arrays::
In [1]: from stringdtype import StringDType
In [2]: import numpy as np
In [3]: data = [str(i) * 10 for i in range(100_000)]
In [4]: %timeit arr_object = np.array(data, dtype=object) 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit arr_strdtype = np.array(data, dtype=str) 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this example, object DTypes are substantially faster because the objects in the ``data`` list can be directly interned in the array, while ``StrDType`` and ``StringDType`` need to copy the string data and ``StringDType`` needs to convert the data to UTF-8 and perform additional heap allocations outside the array buffer. In the future, if Python moves to a UTF-8 internal representation for strings, the string loading performance of ``StringDType`` should improve.
String operations have similar performance::
In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.char.capitalize(arr_stringdtype) 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit np.char.capitalize(arr_strdtype) 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The poor performance here is a reflection of the slow iterator-based implementation of operations in ``np.char``. If we were to rewrite these operations as ufuncs, we could unlock substantial performance improvements. Using the example of the ``add`` ufunc, which we have implemented for the ``StringDType`` prototype::
In [10]: %timeit arr_object + arr_object 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit arr_stringdtype + arr_stringdtype 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As described below, we have already updated a fork of Pandas to use a prototype version of ``StringDType``. This demonstrates the performance improvements available when data are already loaded into a NumPy array and are passed to a third-party library. Currently Pandas attempts to coerce all ``str`` data to ``object`` DType by default, and has to check and sanitize existing `` object`` arrays that are passed in. This requires a copy or pass over the data made unnecessary by first-class support for variable-width strings in both NumPy and Pandas::
In [13]: import pandas as pd
In [14]: %timeit pd.Series(arr_stringdtype) 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [15]: %timeit pd.Series(arr_object) 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
We have also implemented a Pandas extension DType that uses ``StringDType `` under the hood, which is also substantially faster for creating Pandas data structures than the existing Pandas string DType that uses ``object`` arrays::
In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [17]: %timeit pd.Series(arr_object, dtype='string[python]') 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Backward compatibility ----------------------
We are not proposing a change to DType inference for python strings and do not expect to see any impacts on existing usages of NumPy, besides warnings or errors related to new deprecations or expiring deprecations in ``np.char ``.
On Tue, Aug 29, 2023 at 4:08 PM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
This is a really nice NEP, thanks Nathan! I see that questions and constructive feedback is still coming in on GitHub, but for now it seems like everyone is pretty happy with moving forward with implementing this new dtype in NumPy. Cheers, Rafl
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,
I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description" section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
========================================================= NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy =========================================================
:Author: Nathan Goldbaum <ngoldbaum@quansight.com> :Status: Draft :Type: Standards Track :Created: 2023-06-29
Abstract --------
We propose adding a new string data type to NumPy where each item in the array is an arbitrary length UTF-8 encoded string. This will enable performance, memory usage, and usability improvements for NumPy users, including:
* Memory savings for workflows that currently use fixed-width strings and store primarily ASCII data or a mix of short and long strings in a single NumPy array.
* Downstream libraries and users will be able to move away from object arrays currently used as a substitute for variable-length string arrays, unlocking performance improvements by avoiding passes over the data outside of NumPy.
* A more intuitive user-facing API for working with arrays of Python strings, without a need to think about the in-memory array representation.
Motivation and Scope --------------------
First, we will describe how the current state of support for string or string-like data in NumPy arose. Next, we will summarize the last major previous discussion about this topic. Finally, we will describe the scope of the proposed changes to NumPy as well as changes that are explicitly out of scope of this proposal.
History of String Support in Numpy **********************************
Support in NumPy for textual data evolved organically in response to early user needs and then changes in the Python ecosystem.
Support for strings was added to numpy to support users of the NumArray ``chararray`` type. Remnants of this are still visible in the NumPy API: string-related functionality lives in ``np.char``, to support the obsolete ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string DTypes.
NumPy's ``bytes_`` DType was originally used to represent the Python 2 `` str`` type before Python 3 support was added to NumPy. The bytes DType makes the most sense when it is used to represent Python 2 strings or other null-terminated byte sequences. However, ignoring data after the first null character means the ``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so it is a poor match for generic bytestreams.
The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for a straightforward implementation, but is inefficient for storing text that can be represented well using a one-byte ASCII or Latin-1 encoding. This was not a problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2 ``str`` DType (the current ``bytes_`` DType).
With the arrival of Python 3 support in NumPy, the string DTypes were largely left alone due to backward compatibility concerns, although the unicode DType became the default DType for ``str`` data and the old ``string`` DType was renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal situation of shipping a data type originally intended for null-terminated bytestrings as the data type for *all* python ``bytes`` data, and a default string type with an in-memory representation that consumes four times as much memory as needed for ASCII or mostly-ASCII data.
Problems with Fixed-Width Strings *********************************
Both existing string DTypes represent fixed-width sequences, allowing storage of the string data in the array buffer. This avoids adding out-of-band storage to NumPy, however, it makes for an awkward user interface. In particular, the maximum string size must be inferred by NumPy or estimated by the user before loading the data into a NumPy array or selecting an output DType for string operations. In the worst case, this requires an expensive pass over the full dataset to calculate the maximum length of an array element. It also wastes memory when array elements have varying lengths. Pathological cases where an array stores many short strings and a few very long strings are particularly bad for wasting memory.
Downstream usage of string data in NumPy arrays has proven out the need for a variable-width string data type. In practice, most downstream users employ ``object`` arrays for this purpose. In particular, ``pandas`` has explicitly deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays to ``object`` arrays, and in the future may switch to only supporting string data via ``PyArrow``, which has native support for UTF-8 encoded variable-width string arrays [1]_. This is unfortunate, since `` object`` arrays have no type guarantees, necessitating expensive sanitization passes and operations using object arrays cannot release the GIL.
Previous Discussions --------------------
The project last discussed this topic in depth in 2017, when Julian Taylor proposed a fixed-width text data type parameterized by an encoding [2]_. This started a wide-ranging discussion about pain points for working with string data in NumPy and possible ways forward.
In the end, the discussion identified two use-cases that the current support for strings does a poor job of handling:
* Loading or memory-mapping scientific datasets with unknown encoding, * Working with string data in a manner that allows transparent conversion between NumPy arrays and Python strings, including support for missing strings.
As a result of this discussion, improving support for string data was added to the NumPy project roadmap [3]_, with an explicit call-out to add a DType better suited to memory-mapping bytes with any or no encoding, and a variable-width string DType that supports missing data to replace usages of object string arrays.
Proposed work -------------
This NEP proposes adding ``StringDType``, a DType that stores variable-width heap-allocated strings in Numpy arrays, to replace downstream usages of the ``object`` DType for string data. This work will heavily leverage recent improvements in NumPy to improve support for user-defined DTypes, so we will also necessarily be working on the data type internals in NumPy. In particular, we propose to:
* Add a new variable-length string DType to NumPy, targeting NumPy 2.0.
* Work out issues related to adding a DType implemented using the experimental DType API to NumPy itself.
* Support for a user-provided missing data sentinel.
* A cleanup of ``np.char``, with the ufunc-like functions moved to a new namespace for functions and types related to string support.
* An update to the ``npy`` and ``npz`` file formats to allow storage of arbitrary-length sidecar data.
The following is out of scope for this work:
* Changing DType inference for string data.
* Adding a DType for memory-mapping text in unknown encodings or a DType that attempts to fix issues with the ``bytes_`` DType.
* Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.
* Implement fast ufuncs or SIMD optimizations for string operations.
While we're explicitly ruling out implementing these items as part of this work, adding a new string DType helps set up future work that does implement some of these items.
If implemented this NEP will make it easier to add a new fixed-width text DType in the future by moving string operations into a long-term supported namespace. We are also proposing a memory layout that should be amenable to writing fast ufuncs and SIMD optimization in some cases, increasing the payoff for writing string operations as SIMD-optimized ufuncs in the future.
While we are not proposing adding a missing data sentinel to NumPy, we are proposing adding support for an optional, user-provided missing data sentinel, so this does move NumPy a little closer to officially supporting missing data. We are attempting to avoid resolving the disagreement described in :ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a missing data sentinel or bitflag-based missing data support in the future.
Usage and Impact ----------------
The DType is intended as a drop-in replacement for object string arrays. This means that we intend to support as many downstream usages of object string arrays as possible, including all supported NumPy functionality. Pandas is the obvious first user, and substantial work has already occurred to add support in a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be able to migrate to a DType with guarantees that the arrays contains only strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class support for variable-width UTF-8 encoded string datasets in HDF5. String data are heavily used in machine-learning workflows and downstream machine learning libraries will be able to leverage this new DType.
Users who wish to load string data into NumPy and leverage NumPy features like fancy advanced indexing will have a natural choice that offers substantial memory savings over fixed-width unicode strings and better validation guarantees and overall integration with NumPy than object string arrays. Moving to a first-class string DType also removes the need to acquire the GIL during string operations, unlocking future optimizations that are impossible with object string arrays.
Performance ***********
Here we briefly describe preliminary performance measurements of the prototype version of ``StringDType`` we have implemented outside of NumPy using the experimental DType API. All benchmarks in this section were performed on a Dell XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy, Pandas, and the ``StringDType`` prototype were all compiled with meson release builds.
Currently, the ``StringDType`` prototype has comparable performance with object arrays and fixed-width string arrays. One exception is array creation from python strings, performance is somewhat slower than object arrays and comparable to fixed-width unicode arrays::
In [1]: from stringdtype import StringDType
In [2]: import numpy as np
In [3]: data = [str(i) * 10 for i in range(100_000)]
In [4]: %timeit arr_object = np.array(data, dtype=object) 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit arr_strdtype = np.array(data, dtype=str) 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this example, object DTypes are substantially faster because the objects in the ``data`` list can be directly interned in the array, while ``StrDType`` and ``StringDType`` need to copy the string data and ``StringDType`` needs to convert the data to UTF-8 and perform additional heap allocations outside the array buffer. In the future, if Python moves to a UTF-8 internal representation for strings, the string loading performance of ``StringDType`` should improve.
String operations have similar performance::
In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.char.capitalize(arr_stringdtype) 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit np.char.capitalize(arr_strdtype) 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The poor performance here is a reflection of the slow iterator-based implementation of operations in ``np.char``. If we were to rewrite these operations as ufuncs, we could unlock substantial performance improvements. Using the example of the ``add`` ufunc, which we have implemented for the ``StringDType`` prototype::
In [10]: %timeit arr_object + arr_object 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit arr_stringdtype + arr_stringdtype 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As described below, we have already updated a fork of Pandas to use a prototype version of ``StringDType``. This demonstrates the performance improvements available when data are already loaded into a NumPy array and are passed to a third-party library. Currently Pandas attempts to coerce all ``str`` data to ``object`` DType by default, and has to check and sanitize existing `` object`` arrays that are passed in. This requires a copy or pass over the data made unnecessary by first-class support for variable-width strings in both NumPy and Pandas::
In [13]: import pandas as pd
In [14]: %timeit pd.Series(arr_stringdtype) 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [15]: %timeit pd.Series(arr_object) 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
We have also implemented a Pandas extension DType that uses ``StringDType `` under the hood, which is also substantially faster for creating Pandas data structures than the existing Pandas string DType that uses ``object`` arrays::
In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [17]: %timeit pd.Series(arr_object, dtype='string[python]') 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Backward compatibility ----------------------
We are not proposing a change to DType inference for python strings and do not expect to see any impacts on existing usages of NumPy, besides warnings or errors related to new deprecations or expiring deprecations in ``np.char ``.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com
On Wed, Aug 30, 2023 at 4:25 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Tue, Aug 29, 2023 at 4:08 PM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
This is a really nice NEP, thanks Nathan! I see that questions and constructive feedback is still coming in on GitHub, but for now it seems like everyone is pretty happy with moving forward with implementing this new dtype in NumPy.
Cheers, Rafl
To echo Ralf comments, thank you for this very well-written proposal! I particularly appreciate the detailed consideration of how to handle different models of missing values. Overall, I am very excited about this work. A UTF8 dtype in NumPy is long overdue, and will bring significant benefits to the entire scientific Python ecosystem.
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,
I just opened a pull request to add NEP 55, see
Per NEP 0, I've copied everything up to the "detailed description"
https://github.com/numpy/numpy/pull/24483. section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html So +1 for the enhancement! Now for some nitty-gritty review... There is a design change that I think should be made in the implementation of missing values. In the current design described in the NEP, and expanded on in the comment https://github.com/numpy/numpy/pull/24483#discussion_r1311815944, the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.) I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values. The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`. I experimented with these ideas, and put the result at https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO. In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details. With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype. Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable. What do you think? --Warren [1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description"
wrote: section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance this seems like a really nice improvement. I'm going to try to integrate your proposed design into the dtype prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
There is a design change that I think should be made in the implementation of missing values.
In the current design described in the NEP, and expanded on in the comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details.
With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com> wrote:
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description"
wrote: section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance this seems like a really nice improvement.
I'm going to try to integrate your proposed design into the dtype prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine. I have a few more comments and questions about the NEP that I'll finish up and send this weekend. Warren
There is a design change that I think should be made in the implementation of missing values.
In the current design described in the NEP, and expanded on in the comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details.
With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: warren.weckesser@gmail.com
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com> wrote:
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description"
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
warren.weckesser@gmail.com> wrote: section below. this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll finish
up and send this weekend.
One more comment on the NEP... My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like". The handling of `np.nastring` would be an intrinsic part of the dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner. The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough? If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.) Any, that's my 2¢. Warren
Warren
There is a design change that I think should be made in the implementation of missing values.
In the current design described in the NEP, and expanded on in the comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in fact, with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details.
With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: warren.weckesser@gmail.com
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description"
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
wrote: section below. this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll finish
up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`. The handling of `np.nastring` would be an intrinsic part of the
dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide" That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho.
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too. Cheers, Ralf
, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
Warren
There is a design change that I think should be made in the implementation of missing values.
In the current design described in the NEP, and expanded on in the comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in
fact,
with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details.
With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: warren.weckesser@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: ralf.gommers@googlemail.com
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
> > Hello all, > > I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. > > Per NEP 0, I've copied everything up to the "detailed description"
> > I'm looking forward to your feedback on this. > > -Nathan Goldbaum >
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
wrote: section below. this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll finish
up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`.
The handling of `np.nastring` would be an intrinsic part of the
dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide"
That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho.
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too.
Cheers, Ralf
, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
I was a bit surprised that len was not used as part of the missing value. The NEP proposal that 0 is a empty string unless there is a sentinal in which case it is a missing value feels pretty limiting, since these are distinctly different things. Would it make sense for len<0 to indicate a missing value. This would require using ssize_t instead of size_t, and would then limit the string size. In principle this would allow for sizeof(ssize_t) / 2 distinct missing value. I think ssize_t is well-defined on all platforms targeted by NumPy. Kevin
On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <kevin.k.sheppard@gmail.com> wrote:
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
> > The NEP was merged in draft form, see below. > > https://numpy.org/neps/nep-0055-string_dtype.html > > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote: >> >> Hello all, >> >> I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. >> >> Per NEP 0, I've copied everything up to the "detailed description" section below. >> >> I'm looking forward to your feedback on this. >> >> -Nathan Goldbaum >>
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
wrote: this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll
finish up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`.
The handling of `np.nastring` would be an intrinsic part of the
dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide"
That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho.
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too.
Cheers, Ralf
, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
I was a bit surprised that len was not used as part of the missing value. The NEP proposal that 0 is a empty string unless there is a sentinal in which case it is a missing value feels pretty limiting, since these are distinctly different things.
Would it make sense for len<0 to indicate a missing value. This would require using ssize_t instead of size_t, and would then limit the string size. In principle this would allow for sizeof(ssize_t) / 2 distinct missing value. I think ssize_t is well-defined on all platforms targeted by NumPy.
Kevin
Hey Kevin, Thanks for the comment. Right now the current NEP text is a little out of date compared to the implementation. I've since rewritten it to use Warren's proposal more or less verbatim, so now the missing value flag is stored in a bit of the size field See https://github.com/numpy/numpy-user-dtypes/pull/86 for the implementation, which also includes a small string optimization implementation.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
Hi, I know that I'm a little late to be asking about this, but I don't see a comment elsewhere on it (in the NEP, the implementation PR #25347, or this email thread). As I understand it, the new StringDType implementation distinguishes 3 types of individual strings, any of which can be present in an array: 1. short strings, included inline in the array (at most 15 bytes on a 64-bit system) 2. arena-allocated strings, which are managed by the npy_string_allocator 3. heap-allocated strings, which are pointers anywhere in RAM. Does case 3 include strings that are passed to the array as views, without copying? If so, then the ownership of strings would either need to be tracked on a per-string basis (distinct from the array_owned boolean, which characterizes the whole array), or they need to all be considered stolen references (NumPy will free all of them when the array goes out of scope), or they all need to be considered borrowed references (NumPy will not free any of them when the array goes out of scope). If the array does not accept new strings as views, but always copies any externally provided string, then why distinguish between cases 2 and 3? How would an array end up with some strings being arena-allocated and other strings being heap-allocated? Thanks! -- Jim On Wed, Sep 20, 2023 at 10:25 AM Nathan <nathan.goldbaum@gmail.com> wrote:
On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <kevin.k.sheppard@gmail.com> wrote:
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
> > > > On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote: > > > > The NEP was merged in draft form, see below. > > > > https://numpy.org/neps/nep-0055-string_dtype.html > > > > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote: > >> > >> Hello all, > >> > >> I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. > >> > >> Per NEP 0, I've copied everything up to the "detailed description" section below. > >> > >> I'm looking forward to your feedback on this. > >> > >> -Nathan Goldbaum > >> > > This will be a nice addition to NumPy, and matches a suggestion by > @rkern (and probably others) made in the 2017 mailing list thread; > see the last bullet of > > https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html > > So +1 for the enhancement! > > Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
wrote: this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll
finish up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`.
The handling of `np.nastring` would be an intrinsic part of the
dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide"
That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho.
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too.
Cheers, Ralf
, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
I was a bit surprised that len was not used as part of the missing value. The NEP proposal that 0 is a empty string unless there is a sentinal in which case it is a missing value feels pretty limiting, since these are distinctly different things.
Would it make sense for len<0 to indicate a missing value. This would require using ssize_t instead of size_t, and would then limit the string size. In principle this would allow for sizeof(ssize_t) / 2 distinct missing value. I think ssize_t is well-defined on all platforms targeted by NumPy.
Kevin
Hey Kevin,
Thanks for the comment. Right now the current NEP text is a little out of date compared to the implementation. I've since rewritten it to use Warren's proposal more or less verbatim, so now the missing value flag is stored in a bit of the size field
See https://github.com/numpy/numpy-user-dtypes/pull/86 for the implementation, which also includes a small string optimization implementation.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
On Mon, Feb 12, 2024 at 1:47 PM Jim Pivarski <jpivarski@gmail.com> wrote:
Hi,
I know that I'm a little late to be asking about this, but I don't see a comment elsewhere on it (in the NEP, the implementation PR #25347, or this email thread).
As I understand it, the new StringDType implementation distinguishes 3 types of individual strings, any of which can be present in an array:
1. short strings, included inline in the array (at most 15 bytes on a 64-bit system) 2. arena-allocated strings, which are managed by the npy_string_allocator 3. heap-allocated strings, which are pointers anywhere in RAM.
Does case 3 include strings that are passed to the array as views, without copying? If so, then the ownership of strings would either need to be tracked on a per-string basis (distinct from the array_owned boolean, which characterizes the whole array), or they need to all be considered stolen references (NumPy will free all of them when the array goes out of scope), or they all need to be considered borrowed references (NumPy will not free any of them when the array goes out of scope).
Stringdtyoe arrays don’t intern python strings directly, there’s always a copy. Array views are allowed, but I don’t think that’s what you’re talking about. The mutex guarding access to the string data prevents arrays from being garbage collected while a C thread holds a pointer to the string data, at least assuming correct usage of the C API that doesn’t try to use a string after releasing the allocator.
If the array does not accept new strings as views, but always copies any externally provided string, then why distinguish between cases 2 and 3? How would an array end up with some strings being arena-allocated and other strings being heap-allocated?
You can only get a heap string entry in an array if you enlarge an entry in the array. The goal with allowing heap strings like this was to have an escape hatch that allows enlarging a single array entry without adding complexity or needing to re-allocate the entire arena buffer. For example, if you create an array with a short string entry and then edit that entry to be longer than 15 bytes. Rather than appending to the arena or re-allocating it, we convert the entry to a heap string.
Thanks! -- Jim
On Wed, Sep 20, 2023 at 10:25 AM Nathan <nathan.goldbaum@gmail.com> wrote:
On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard < kevin.k.sheppard@gmail.com> wrote:
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
> > > > On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < warren.weckesser@gmail.com> wrote: >> >> >> >> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com> wrote: >> > >> > The NEP was merged in draft form, see below. >> > >> > https://numpy.org/neps/nep-0055-string_dtype.html >> > >> > On Mon, Aug 21, 2023 at 2:36 PM Nathan < nathan.goldbaum@gmail.com> wrote: >> >> >> >> Hello all, >> >> >> >> I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. >> >> >> >> Per NEP 0, I've copied everything up to the "detailed description" section below. >> >> >> >> I'm looking forward to your feedback on this. >> >> >> >> -Nathan Goldbaum >> >> >> >> This will be a nice addition to NumPy, and matches a suggestion by >> @rkern (and probably others) made in the 2017 mailing list thread; >> see the last bullet of >> >> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >> >> So +1 for the enhancement! >> >> Now for some nitty-gritty review... > > > Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
> > I'm going to try to integrate your proposed design into the dtype
wrote: this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll
finish up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`.
The handling of `np.nastring` would be an intrinsic part of the
dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide"
That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho.
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value
It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too.
Cheers, Ralf
, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
I was a bit surprised that len was not used as part of the missing value. The NEP proposal that 0 is a empty string unless there is a sentinal in which case it is a missing value feels pretty limiting, since these are distinctly different things.
Would it make sense for len<0 to indicate a missing value. This would require using ssize_t instead of size_t, and would then limit the string size. In principle this would allow for sizeof(ssize_t) / 2 distinct missing value. I think ssize_t is well-defined on all platforms targeted by NumPy.
Kevin
Hey Kevin,
Thanks for the comment. Right now the current NEP text is a little out of date compared to the implementation. I've since rewritten it to use Warren's proposal more or less verbatim, so now the missing value flag is stored in a bit of the size field
See https://github.com/numpy/numpy-user-dtypes/pull/86 for the implementation, which also includes a small string optimization implementation.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: jpivarski@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
I see: thank you for the explanations! On Mon, Feb 12, 2024 at 3:04 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Stringdtyoe arrays don’t intern python strings directly, there’s always a copy.
I had been thinking of accepting a memoryview without copying, but if there's always a copy in any case, that answers my question about ownership. -- Jim
On Wed, Sep 20, 2023 at 12:26 AM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldbaum@gmail.com>
On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
warren.weckesser@gmail.com> wrote:
On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldbaum@gmail.com>
wrote:
The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com>
wrote:
Hello all,
I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
Per NEP 0, I've copied everything up to the "detailed description"
I'm looking forward to your feedback on this.
-Nathan Goldbaum
This will be a nice addition to NumPy, and matches a suggestion by @rkern (and probably others) made in the 2017 mailing list thread; see the last bullet of
https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
So +1 for the enhancement!
Now for some nitty-gritty review...
Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance
I'm going to try to integrate your proposed design into the dtype
wrote: section below. this seems like a really nice improvement. prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright?
Sure, that would be fine.
I have a few more comments and questions about the NEP that I'll finish
up and send this weekend.
One more comment on the NEP...
My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like".
The handling of `np.nastring` would be an intrinsic part of the dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner.
The use-case for the string sentinel does not seem very compelling (but maybe I just don't understand the use-cases). If there is a real need here that is not covered by `np.nastring`, perhaps just a flag to control the repr of `np.nastring` for each StringDType instance would be enough?
If there is an objection to a potential proliferation of "not a thing" special values, one for each type that can handle them, then perhaps a generic "not a value" (say `np.navalue`) could be created that, when assigned to an element of an array, results in the appropriate "not a thing" value actually being assigned. In a sense, I guess this NEP is proposing that, but it is reusing the floating point object `np.nan` as the generic "not a thing" value, and my preference is that, *if* we go with such a generic object, it is not the floating point value `nan` but a new thing with a name that reflects its purpose. (I guess Pandas users might be accustomed to `nan` being a generic sentinel for missing data, so its use doesn't feel as incohesive as it might to others. Passing a string array to `np.isnan()` just feels *wrong* to me.)
Any, that's my 2¢.
Warren
In addition to Ralf's points, I don't think it's possible for NumPy to support all downstream usages of object string arrays without something like what's in the NEP. Some downstream libraries want their NA sentinel to not be comparable with strings (like `None`). Some people want the result of comparisons with the NA sentinel to return the NA sentinel (libraries that use np.nan, pandas.NA also works like this). Others want the sentinel to behave like a string and have a well-defined ordering (pandas does this internally to support sorting strings with missing data in a low-level C routine). I don't see how it's possible to simultaneously support all of this in a single sentinel object, unless that object can be created with some parameters, and then we're no simpler than what I'm proposing *and* we have to decide on sensible default behavior.
Warren
There is a design change that I think should be made in the implementation of missing values.
In the current design described in the NEP, and expanded on in the comment
https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
the meaning of the values `{len = 0, buf = NULL}` in an instance of `npy_static_string` depends on whether or not the `na_object` has been set in the dtype. If it has not been set, that data represents a string of length 0. If `na_object` *has* been set, that data represents a missing value. To get a string of length 0 in this case, some non-NULL value must be assigned to the `buf` field. (In the comment linked above, @ngoldbaum suggested `{0, "\0"}`, but strings are not NUL-terminated, so there is no need for that `\0` in `buf`, and in
fact,
with `len == 0`, it would be a bug for the pointer to be dereferenced, so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
I think it would be better if `len == 0` *always* meant a string with length 0, with no additional qualifications; it shouldn't be necessary to put some non-NULL value in `buf` just to get an empty string. We can achieve this if we use a bit in `len` as a flag for a missing value. Reserving a bit from `len` as a flag reduces the maximum possible string length, but as discussed in the NEP pull request, we're almost certainly going to reserve at least the high bit of `len` when small string optimization (SSO) is implemented. This will reduce the maximum string length to `2**(N-1)-1`, where `N` is the bit width of `size_t` (equivalent to using a signed type for `len`). Even if SSO isn't implemented immediately, we can anticipate the need for flags stored in `len`, and use them to implement missing values.
The actual implementation of SSO will require some more design work, because the offset of the most significant byte of `len` within the `npy_static_string` struct depends on the platform endianess. For little-endian, the most significant byte is not the first byte in the struct, so the bytes available for SSO within the struct are not contiguous when the fields have the order `{len, buf}`.
I experimented with these ideas, and put the result at
https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
The idea that I propose there is to make the memory layout of the struct depend on the endianess of the platform, so the most significant byte of `len` (which I called `size`, to avoid any chance of confusion with the actual length of the string [1]) is at the beginning of the struct on big-endian platforms and at the end of the struct for little-endian platforms. More details are included in the file README.md. Note that I am not suggesting that all the SSO stuff be included in the current NEP! This is just a proof-of-concept that shows one possibility for SSO.
In that design, the high bit of `size` (which is `len` here) being set indicates that the `npy_static_string` struct should not be interpreted as the standard `{len, buf}` representation of a string. When the second highest bit is set, it means we have a missing value. If the second highest bit is not set, SSO is active; see the link above for more details.
With this design, `len == 0` *always* means a string of length 0, regardless of whether or not `na_object` is defined in the dtype.
Also with this design, an array created with `calloc()` will automatically be an array of empty strings. With current design in the NEP, an array created with `calloc()` will be either an array of empty strings, or an array of missing values, depending on whether or not the dtype has `na_object` defined. That conditional behavior seems less than desirable.
What do you think?
--Warren
[1] I would like to see `len` renamed to `size` in the `npy_static_string` struct, but that's bikeshed stuff, and not a blocker.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: warren.weckesser@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
participants (6)
-
Jim Pivarski
-
Kevin Sheppard
-
Nathan
-
Ralf Gommers
-
Stephan Hoyer
-
Warren Weckesser