Hello all, I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. Per NEP 0, I've copied everything up to the "detailed description" section below. I'm looking forward to your feedback on this. -Nathan Goldbaum ========================================================= NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy ========================================================= :Author: Nathan Goldbaum <ngoldbaum@quansight.com> :Status: Draft :Type: Standards Track :Created: 2023-06-29 Abstract -------- We propose adding a new string data type to NumPy where each item in the array is an arbitrary length UTF-8 encoded string. This will enable performance, memory usage, and usability improvements for NumPy users, including: * Memory savings for workflows that currently use fixed-width strings and store primarily ASCII data or a mix of short and long strings in a single NumPy array. * Downstream libraries and users will be able to move away from object arrays currently used as a substitute for variable-length string arrays, unlocking performance improvements by avoiding passes over the data outside of NumPy. * A more intuitive user-facing API for working with arrays of Python strings, without a need to think about the in-memory array representation. Motivation and Scope -------------------- First, we will describe how the current state of support for string or string-like data in NumPy arose. Next, we will summarize the last major previous discussion about this topic. Finally, we will describe the scope of the proposed changes to NumPy as well as changes that are explicitly out of scope of this proposal. History of String Support in Numpy ********************************** Support in NumPy for textual data evolved organically in response to early user needs and then changes in the Python ecosystem. Support for strings was added to numpy to support users of the NumArray ``chararray`` type. Remnants of this are still visible in the NumPy API: string-related functionality lives in ``np.char``, to support the obsolete ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string DTypes. NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str `` type before Python 3 support was added to NumPy. The bytes DType makes the most sense when it is used to represent Python 2 strings or other null-terminated byte sequences. However, ignoring data after the first null character means the ``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so it is a poor match for generic bytestreams. The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for a straightforward implementation, but is inefficient for storing text that can be represented well using a one-byte ASCII or Latin-1 encoding. This was not a problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2 ``str`` DType (the current ``bytes_`` DType). With the arrival of Python 3 support in NumPy, the string DTypes were largely left alone due to backward compatibility concerns, although the unicode DType became the default DType for ``str`` data and the old ``string`` DType was renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal situation of shipping a data type originally intended for null-terminated bytestrings as the data type for *all* python ``bytes`` data, and a default string type with an in-memory representation that consumes four times as much memory as needed for ASCII or mostly-ASCII data. Problems with Fixed-Width Strings ********************************* Both existing string DTypes represent fixed-width sequences, allowing storage of the string data in the array buffer. This avoids adding out-of-band storage to NumPy, however, it makes for an awkward user interface. In particular, the maximum string size must be inferred by NumPy or estimated by the user before loading the data into a NumPy array or selecting an output DType for string operations. In the worst case, this requires an expensive pass over the full dataset to calculate the maximum length of an array element. It also wastes memory when array elements have varying lengths. Pathological cases where an array stores many short strings and a few very long strings are particularly bad for wasting memory. Downstream usage of string data in NumPy arrays has proven out the need for a variable-width string data type. In practice, most downstream users employ ``object`` arrays for this purpose. In particular, ``pandas`` has explicitly deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays to ``object`` arrays, and in the future may switch to only supporting string data via ``PyArrow``, which has native support for UTF-8 encoded variable-width string arrays [1]_. This is unfortunate, since `` object`` arrays have no type guarantees, necessitating expensive sanitization passes and operations using object arrays cannot release the GIL. Previous Discussions -------------------- The project last discussed this topic in depth in 2017, when Julian Taylor proposed a fixed-width text data type parameterized by an encoding [2]_. This started a wide-ranging discussion about pain points for working with string data in NumPy and possible ways forward. In the end, the discussion identified two use-cases that the current support for strings does a poor job of handling: * Loading or memory-mapping scientific datasets with unknown encoding, * Working with string data in a manner that allows transparent conversion between NumPy arrays and Python strings, including support for missing strings. As a result of this discussion, improving support for string data was added to the NumPy project roadmap [3]_, with an explicit call-out to add a DType better suited to memory-mapping bytes with any or no encoding, and a variable-width string DType that supports missing data to replace usages of object string arrays. Proposed work ------------- This NEP proposes adding ``StringDType``, a DType that stores variable-width heap-allocated strings in Numpy arrays, to replace downstream usages of the ``object`` DType for string data. This work will heavily leverage recent improvements in NumPy to improve support for user-defined DTypes, so we will also necessarily be working on the data type internals in NumPy. In particular, we propose to: * Add a new variable-length string DType to NumPy, targeting NumPy 2.0. * Work out issues related to adding a DType implemented using the experimental DType API to NumPy itself. * Support for a user-provided missing data sentinel. * A cleanup of ``np.char``, with the ufunc-like functions moved to a new namespace for functions and types related to string support. * An update to the ``npy`` and ``npz`` file formats to allow storage of arbitrary-length sidecar data. The following is out of scope for this work: * Changing DType inference for string data. * Adding a DType for memory-mapping text in unknown encodings or a DType that attempts to fix issues with the ``bytes_`` DType. * Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself. * Implement fast ufuncs or SIMD optimizations for string operations. While we're explicitly ruling out implementing these items as part of this work, adding a new string DType helps set up future work that does implement some of these items. If implemented this NEP will make it easier to add a new fixed-width text DType in the future by moving string operations into a long-term supported namespace. We are also proposing a memory layout that should be amenable to writing fast ufuncs and SIMD optimization in some cases, increasing the payoff for writing string operations as SIMD-optimized ufuncs in the future. While we are not proposing adding a missing data sentinel to NumPy, we are proposing adding support for an optional, user-provided missing data sentinel, so this does move NumPy a little closer to officially supporting missing data. We are attempting to avoid resolving the disagreement described in :ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a missing data sentinel or bitflag-based missing data support in the future. Usage and Impact ---------------- The DType is intended as a drop-in replacement for object string arrays. This means that we intend to support as many downstream usages of object string arrays as possible, including all supported NumPy functionality. Pandas is the obvious first user, and substantial work has already occurred to add support in a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be able to migrate to a DType with guarantees that the arrays contains only strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class support for variable-width UTF-8 encoded string datasets in HDF5. String data are heavily used in machine-learning workflows and downstream machine learning libraries will be able to leverage this new DType. Users who wish to load string data into NumPy and leverage NumPy features like fancy advanced indexing will have a natural choice that offers substantial memory savings over fixed-width unicode strings and better validation guarantees and overall integration with NumPy than object string arrays. Moving to a first-class string DType also removes the need to acquire the GIL during string operations, unlocking future optimizations that are impossible with object string arrays. Performance *********** Here we briefly describe preliminary performance measurements of the prototype version of ``StringDType`` we have implemented outside of NumPy using the experimental DType API. All benchmarks in this section were performed on a Dell XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy, Pandas, and the ``StringDType`` prototype were all compiled with meson release builds. Currently, the ``StringDType`` prototype has comparable performance with object arrays and fixed-width string arrays. One exception is array creation from python strings, performance is somewhat slower than object arrays and comparable to fixed-width unicode arrays:: In [1]: from stringdtype import StringDType In [2]: import numpy as np In [3]: data = [str(i) * 10 for i in range(100_000)] In [4]: %timeit arr_object = np.array(data, dtype=object) 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [6]: %timeit arr_strdtype = np.array(data, dtype=str) 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In this example, object DTypes are substantially faster because the objects in the ``data`` list can be directly interned in the array, while ``StrDType`` and ``StringDType`` need to copy the string data and ``StringDType`` needs to convert the data to UTF-8 and perform additional heap allocations outside the array buffer. In the future, if Python moves to a UTF-8 internal representation for strings, the string loading performance of ``StringDType`` should improve. String operations have similar performance:: In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit np.char.capitalize(arr_stringdtype) 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [9]: %timeit np.char.capitalize(arr_strdtype) 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) The poor performance here is a reflection of the slow iterator-based implementation of operations in ``np.char``. If we were to rewrite these operations as ufuncs, we could unlock substantial performance improvements. Using the example of the ``add`` ufunc, which we have implemented for the ``StringDType`` prototype:: In [10]: %timeit arr_object + arr_object 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [11]: %timeit arr_stringdtype + arr_stringdtype 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) As described below, we have already updated a fork of Pandas to use a prototype version of ``StringDType``. This demonstrates the performance improvements available when data are already loaded into a NumPy array and are passed to a third-party library. Currently Pandas attempts to coerce all ``str`` data to ``object`` DType by default, and has to check and sanitize existing ``object `` arrays that are passed in. This requires a copy or pass over the data made unnecessary by first-class support for variable-width strings in both NumPy and Pandas:: In [13]: import pandas as pd In [14]: %timeit pd.Series(arr_stringdtype) 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [15]: %timeit pd.Series(arr_object) 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) We have also implemented a Pandas extension DType that uses ``StringDType`` under the hood, which is also substantially faster for creating Pandas data structures than the existing Pandas string DType that uses ``object`` arrays:: In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [17]: %timeit pd.Series(arr_object, dtype='string[python]') 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Backward compatibility ---------------------- We are not proposing a change to DType inference for python strings and do not expect to see any impacts on existing usages of NumPy, besides warnings or errors related to new deprecations or expiring deprecations in ``np.char``.