Hello all,
Per NEP 0, I've copied everything up to the "detailed description" section below.
I'm looking forward to your feedback on this.
-Nathan Goldbaum
========================================================= |
NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy |
========================================================= |
We propose adding a new string data type to NumPy where each item in the array |
is an arbitrary length UTF-8 encoded string. This will enable performance, |
memory usage, and usability improvements for NumPy users, including: |
* Memory savings for workflows that currently use fixed-width strings and store |
primarily ASCII data or a mix of short and long strings in a single NumPy |
* Downstream libraries and users will be able to move away from object arrays |
currently used as a substitute for variable-length string arrays, unlocking |
performance improvements by avoiding passes over the data outside of NumPy. |
* A more intuitive user-facing API for working with arrays of Python strings, |
without a need to think about the in-memory array representation. |
First, we will describe how the current state of support for string or |
string-like data in NumPy arose. Next, we will summarize the last major previous |
discussion about this topic. Finally, we will describe the scope of the proposed |
changes to NumPy as well as changes that are explicitly out of scope of this |
History of String Support in Numpy |
********************************** |
Support in NumPy for textual data evolved organically in response to early user |
needs and then changes in the Python ecosystem. |
Support for strings was added to numpy to support users of the NumArray |
``chararray`` type. Remnants of this are still visible in the NumPy API: |
string-related functionality lives in ``np.char``, to support the obsolete |
``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string |
NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str`` |
type before Python 3 support was added to NumPy. The bytes DType makes the most |
sense when it is used to represent Python 2 strings or other null-terminated |
byte sequences. However, ignoring data after the first null character means the |
``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so |
it is a poor match for generic bytestreams. |
The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It |
stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for |
a straightforward implementation, but is inefficient for storing text that can |
be represented well using a one-byte ASCII or Latin-1 encoding. This was not a |
problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2 |
``str`` DType (the current ``bytes_`` DType). |
With the arrival of Python 3 support in NumPy, the string DTypes were largely |
left alone due to backward compatibility concerns, although the unicode DType |
became the default DType for ``str`` data and the old ``string`` DType was |
renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal |
situation of shipping a data type originally intended for null-terminated |
bytestrings as the data type for *all* python ``bytes`` data, and a default |
string type with an in-memory representation that consumes four times as much |
memory as needed for ASCII or mostly-ASCII data. |
Problems with Fixed-Width Strings |
********************************* |
Both existing string DTypes represent fixed-width sequences, allowing storage of |
the string data in the array buffer. This avoids adding out-of-band storage to |
NumPy, however, it makes for an awkward user interface. In particular, the |
maximum string size must be inferred by NumPy or estimated by the user before |
loading the data into a NumPy array or selecting an output DType for string |
operations. In the worst case, this requires an expensive pass over the full |
dataset to calculate the maximum length of an array element. It also wastes |
memory when array elements have varying lengths. Pathological cases where an |
array stores many short strings and a few very long strings are particularly bad |
Downstream usage of string data in NumPy arrays has proven out the need for a |
variable-width string data type. In practice, most downstream users employ |
``object`` arrays for this purpose. In particular, ``pandas`` has explicitly |
deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width |
string arrays to ``object`` arrays, and in the future may switch to only |
supporting string data via ``PyArrow``, which has native support for UTF-8 |
encoded variable-width string arrays [1]_. This is unfortunate, since ``object`` |
arrays have no type guarantees, necessitating expensive sanitization passes and |
operations using object arrays cannot release the GIL. |
The project last discussed this topic in depth in 2017, when Julian Taylor |
proposed a fixed-width text data type parameterized by an encoding [2]_. This |
started a wide-ranging discussion about pain points for working with string data |
in NumPy and possible ways forward. |
In the end, the discussion identified two use-cases that the current support for |
strings does a poor job of handling: |
* Loading or memory-mapping scientific datasets with unknown encoding, |
* Working with string data in a manner that allows transparent conversion |
between NumPy arrays and Python strings, including support for missing |
As a result of this discussion, improving support for string data was added to |
the NumPy project roadmap [3]_, with an explicit call-out to add a DType better |
suited to memory-mapping bytes with any or no encoding, and a variable-width |
string DType that supports missing data to replace usages of object string |
This NEP proposes adding ``StringDType``, a DType that stores variable-width |
heap-allocated strings in Numpy arrays, to replace downstream usages of the |
``object`` DType for string data. This work will heavily leverage recent |
improvements in NumPy to improve support for user-defined DTypes, so we will |
also necessarily be working on the data type internals in NumPy. In particular, |
* Add a new variable-length string DType to NumPy, targeting NumPy 2.0. |
* Work out issues related to adding a DType implemented using the experimental |
DType API to NumPy itself. |
* Support for a user-provided missing data sentinel. |
* A cleanup of ``np.char``, with the ufunc-like functions moved to a new |
namespace for functions and types related to string support. |
* An update to the ``npy`` and ``npz`` file formats to allow storage of |
arbitrary-length sidecar data. |
The following is out of scope for this work: |
* Changing DType inference for string data. |
* Adding a DType for memory-mapping text in unknown encodings or a DType that |
attempts to fix issues with the ``bytes_`` DType. |
* Fully agreeing on the semantics of a missing data sentinels or adding a |
missing data sentinel to NumPy itself. |
* Implement fast ufuncs or SIMD optimizations for string operations. |
While we're explicitly ruling out implementing these items as part of this work, |
adding a new string DType helps set up future work that does implement some of |
If implemented this NEP will make it easier to add a new fixed-width text DType |
in the future by moving string operations into a long-term supported |
namespace. We are also proposing a memory layout that should be amenable to |
writing fast ufuncs and SIMD optimization in some cases, increasing the payoff |
for writing string operations as SIMD-optimized ufuncs in the future. |
While we are not proposing adding a missing data sentinel to NumPy, we are |
proposing adding support for an optional, user-provided missing data sentinel, |
so this does move NumPy a little closer to officially supporting missing |
data. We are attempting to avoid resolving the disagreement described in |
:ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a |
missing data sentinel or bitflag-based missing data support in the future. |
The DType is intended as a drop-in replacement for object string arrays. This |
means that we intend to support as many downstream usages of object string |
arrays as possible, including all supported NumPy functionality. Pandas is the |
obvious first user, and substantial work has already occurred to add support in |
a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be |
able to migrate to a DType with guarantees that the arrays contains only |
strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class |
support for variable-width UTF-8 encoded string datasets in HDF5. String data |
are heavily used in machine-learning workflows and downstream machine learning |
libraries will be able to leverage this new DType. |
Users who wish to load string data into NumPy and leverage NumPy features like |
fancy advanced indexing will have a natural choice that offers substantial |
memory savings over fixed-width unicode strings and better validation guarantees |
and overall integration with NumPy than object string arrays. Moving to a |
first-class string DType also removes the need to acquire the GIL during string |
operations, unlocking future optimizations that are impossible with object |
Here we briefly describe preliminary performance measurements of the prototype |
version of ``StringDType`` we have implemented outside of NumPy using the |
experimental DType API. All benchmarks in this section were performed on a Dell |
XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy, |
Pandas, and the ``StringDType`` prototype were all compiled with meson release |
Currently, the ``StringDType`` prototype has comparable performance with object |
arrays and fixed-width string arrays. One exception is array creation from |
python strings, performance is somewhat slower than object arrays and comparable |
to fixed-width unicode arrays:: |
In [1]: from stringdtype import StringDType |
In [2]: import numpy as np |
In [3]: data = [str(i) * 10 for i in range(100_000)] |
In [4]: %timeit arr_object = np.array(data, dtype=object) |
3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) |
12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In [6]: %timeit arr_strdtype = np.array(data, dtype=str) |
11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In this example, object DTypes are substantially faster because the objects in |
the ``data`` list can be directly interned in the array, while ``StrDType`` and |
``StringDType`` need to copy the string data and ``StringDType`` needs to |
convert the data to UTF-8 and perform additional heap allocations outside the |
array buffer. In the future, if Python moves to a UTF-8 internal representation |
for strings, the string loading performance of ``StringDType`` should improve. |
String operations have similar performance:: |
In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) |
30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
In [8]: %timeit np.char.capitalize(arr_stringdtype) |
38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
In [9]: %timeit np.char.capitalize(arr_strdtype) |
46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
The poor performance here is a reflection of the slow iterator-based |
implementation of operations in ``np.char``. If we were to rewrite these |
operations as ufuncs, we could unlock substantial performance |
improvements. Using the example of the ``add`` ufunc, which we have implemented |
for the ``StringDType`` prototype:: |
In [10]: %timeit arr_object + arr_object |
10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In [11]: %timeit arr_stringdtype + arr_stringdtype |
5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) |
65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
As described below, we have already updated a fork of Pandas to use a prototype |
version of ``StringDType``. This demonstrates the performance improvements |
available when data are already loaded into a NumPy array and are passed to a |
third-party library. Currently Pandas attempts to coerce all ``str`` data to |
``object`` DType by default, and has to check and sanitize existing ``object`` |
arrays that are passed in. This requires a copy or pass over the data made |
unnecessary by first-class support for variable-width strings in both NumPy and |
In [13]: import pandas as pd |
In [14]: %timeit pd.Series(arr_stringdtype) |
20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
In [15]: %timeit pd.Series(arr_object) |
1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
We have also implemented a Pandas extension DType that uses ``StringDType`` |
under the hood, which is also substantially faster for creating Pandas data |
structures than the existing Pandas string DType that uses ``object`` arrays:: |
In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') |
54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
In [17]: %timeit pd.Series(arr_object, dtype='string[python]') |
1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
We are not proposing a change to DType inference for python strings and do not |
expect to see any impacts on existing usages of NumPy, besides warnings or |
errors related to new deprecations or expiring deprecations in ``np.char``. |