The NEP was merged in draft form, see below.

https://numpy.org/neps/nep-0055-string_dtype.html

On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,

I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483.

Per NEP 0, I've copied everything up to the "detailed description" section below.

I'm looking forward to your feedback on this.

-Nathan Goldbaum

=========================================================
NEP 55Add a UTF-8 Variable-Width String DType to NumPy
=========================================================

:Author: Nathan Goldbaum <ngoldbaum@quansight.com>
:Status: Draft
:Type: Standards Track
:Created: 2023-06-29


Abstract
--------

We propose adding a new string data type to NumPy where each item in the array
is an arbitrary length UTF-8 encoded string. This will enable performance,
memory usage, and usability improvements for NumPy users, including:

* Memory savings for workflows that currently use fixed-width strings and store
primarily ASCII data or a mix of short and long strings in a single NumPy
array.

* Downstream libraries and users will be able to move away from object arrays
currently used as a substitute for variable-length string arrays, unlocking
performance improvements by avoiding passes over the data outside of NumPy.

* A more intuitive user-facing API for working with arrays of Python strings,
without a need to think about the in-memory array representation.

Motivation and Scope
--------------------

First, we will describe how the current state of support for string or
string-like data in NumPy arose. Next, we will summarize the last major previous
discussion about this topic. Finally, we will describe the scope of the proposed
changes to NumPy as well as changes that are explicitly out of scope of this
proposal.

History of String Support in Numpy
**********************************

Support in NumPy for textual data evolved organically in response to early user
needs and then changes in the Python ecosystem.

Support for strings was added to numpy to support users of the NumArray
``chararray`` type. Remnants of this are still visible in the NumPy API:
string-related functionality lives in ``np.char``, to support the obsolete
``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string
DTypes.

NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str``
type before Python 3 support was added to NumPy. The bytes DType makes the most
sense when it is used to represent Python 2 strings or other null-terminated
byte sequences. However, ignoring data after the first null character means the
``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so
it is a poor match for generic bytestreams.

The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It
stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for
a straightforward implementation, but is inefficient for storing text that can
be represented well using a one-byte ASCII or Latin-1 encoding. This was not a
problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2
``str`` DType (the current ``bytes_`` DType).

With the arrival of Python 3 support in NumPy, the string DTypes were largely
left alone due to backward compatibility concerns, although the unicode DType
became the default DType for ``str`` data and the old ``string`` DType was
renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
situation of shipping a data type originally intended for null-terminated
bytestrings as the data type for *all* python ``bytes`` data, and a default
string type with an in-memory representation that consumes four times as much
memory as needed for ASCII or mostly-ASCII data.

Problems with Fixed-Width Strings
*********************************

Both existing string DTypes represent fixed-width sequences, allowing storage of
the string data in the array buffer. This avoids adding out-of-band storage to
NumPy, however, it makes for an awkward user interface. In particular, the
maximum string size must be inferred by NumPy or estimated by the user before
loading the data into a NumPy array or selecting an output DType for string
operations. In the worst case, this requires an expensive pass over the full
dataset to calculate the maximum length of an array element. It also wastes
memory when array elements have varying lengths. Pathological cases where an
array stores many short strings and a few very long strings are particularly bad
for wasting memory.

Downstream usage of string data in NumPy arrays has proven out the need for a
variable-width string data type. In practice, most downstream users employ
``object`` arrays for this purpose. In particular, ``pandas`` has explicitly
deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width
string arrays to ``object`` arrays, and in the future may switch to only
supporting string data via ``PyArrow``, which has native support for UTF-8
encoded variable-width string arrays [1]_. This is unfortunate, since ``object``
arrays have no type guarantees, necessitating expensive sanitization passes and
operations using object arrays cannot release the GIL.

Previous Discussions
--------------------

The project last discussed this topic in depth in 2017, when Julian Taylor
proposed a fixed-width text data type parameterized by an encoding [2]_. This
started a wide-ranging discussion about pain points for working with string data
in NumPy and possible ways forward.

In the end, the discussion identified two use-cases that the current support for
strings does a poor job of handling:

* Loading or memory-mapping scientific datasets with unknown encoding,
* Working with string data in a manner that allows transparent conversion
between NumPy arrays and Python strings, including support for missing
strings.

As a result of this discussion, improving support for string data was added to
the NumPy project roadmap [3]_, with an explicit call-out to add a DType better
suited to memory-mapping bytes with any or no encoding, and a variable-width
string DType that supports missing data to replace usages of object string
arrays.

Proposed work
-------------

This NEP proposes adding ``StringDType``, a DType that stores variable-width
heap-allocated strings in Numpy arrays, to replace downstream usages of the
``object`` DType for string data. This work will heavily leverage recent
improvements in NumPy to improve support for user-defined DTypes, so we will
also necessarily be working on the data type internals in NumPy. In particular,
we propose to:

* Add a new variable-length string DType to NumPy, targeting NumPy 2.0.

* Work out issues related to adding a DType implemented using the experimental
DType API to NumPy itself.

* Support for a user-provided missing data sentinel.

* A cleanup of ``np.char``, with the ufunc-like functions moved to a new
namespace for functions and types related to string support.

* An update to the ``npy`` and ``npz`` file formats to allow storage of
arbitrary-length sidecar data.

The following is out of scope for this work:

* Changing DType inference for string data.

* Adding a DType for memory-mapping text in unknown encodings or a DType that
attempts to fix issues with the ``bytes_`` DType.

* Fully agreeing on the semantics of a missing data sentinels or adding a
missing data sentinel to NumPy itself.

* Implement fast ufuncs or SIMD optimizations for string operations.

While we're explicitly ruling out implementing these items as part of this work,
adding a new string DType helps set up future work that does implement some of
these items.

If implemented this NEP will make it easier to add a new fixed-width text DType
in the future by moving string operations into a long-term supported
namespace. We are also proposing a memory layout that should be amenable to
writing fast ufuncs and SIMD optimization in some cases, increasing the payoff
for writing string operations as SIMD-optimized ufuncs in the future.

While we are not proposing adding a missing data sentinel to NumPy, we are
proposing adding support for an optional, user-provided missing data sentinel,
so this does move NumPy a little closer to officially supporting missing
data. We are attempting to avoid resolving the disagreement described in
:ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a
missing data sentinel or bitflag-based missing data support in the future.

Usage and Impact
----------------

The DType is intended as a drop-in replacement for object string arrays. This
means that we intend to support as many downstream usages of object string
arrays as possible, including all supported NumPy functionality. Pandas is the
obvious first user, and substantial work has already occurred to add support in
a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be
able to migrate to a DType with guarantees that the arrays contains only
strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class
support for variable-width UTF-8 encoded string datasets in HDF5. String data
are heavily used in machine-learning workflows and downstream machine learning
libraries will be able to leverage this new DType.

Users who wish to load string data into NumPy and leverage NumPy features like
fancy advanced indexing will have a natural choice that offers substantial
memory savings over fixed-width unicode strings and better validation guarantees
and overall integration with NumPy than object string arrays. Moving to a
first-class string DType also removes the need to acquire the GIL during string
operations, unlocking future optimizations that are impossible with object
string arrays.

Performance
***********

Here we briefly describe preliminary performance measurements of the prototype
version of ``StringDType`` we have implemented outside of NumPy using the
experimental DType API. All benchmarks in this section were performed on a Dell
XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy,
Pandas, and the ``StringDType`` prototype were all compiled with meson release
builds.

Currently, the ``StringDType`` prototype has comparable performance with object
arrays and fixed-width string arrays. One exception is array creation from
python strings, performance is somewhat slower than object arrays and comparable
to fixed-width unicode arrays::

In [1]: from stringdtype import StringDType

In [2]: import numpy as np

In [3]: data = [str(i) * 10 for i in range(100_000)]

In [4]: %timeit arr_object = np.array(data, dtype=object)
3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())
12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit arr_strdtype = np.array(data, dtype=str)
11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In this example, object DTypes are substantially faster because the objects in
the ``data`` list can be directly interned in the array, while ``StrDType`` and
``StringDType`` need to copy the string data and ``StringDType`` needs to
convert the data to UTF-8 and perform additional heap allocations outside the
array buffer. In the future, if Python moves to a UTF-8 internal representation
for strings, the string loading performance of ``StringDType`` should improve.

String operations have similar performance::

In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)
30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.char.capitalize(arr_stringdtype)
38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit np.char.capitalize(arr_strdtype)
46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The poor performance here is a reflection of the slow iterator-based
implementation of operations in ``np.char``. If we were to rewrite these
operations as ufuncs, we could unlock substantial performance
improvements. Using the example of the ``add`` ufunc, which we have implemented
for the ``StringDType`` prototype::

In [10]: %timeit arr_object + arr_object
10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit arr_stringdtype + arr_stringdtype
5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)
65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As described below, we have already updated a fork of Pandas to use a prototype
version of ``StringDType``. This demonstrates the performance improvements
available when data are already loaded into a NumPy array and are passed to a
third-party library. Currently Pandas attempts to coerce all ``str`` data to
``object`` DType by default, and has to check and sanitize existing ``object``
arrays that are passed in. This requires a copy or pass over the data made
unnecessary by first-class support for variable-width strings in both NumPy and
Pandas::

In [13]: import pandas as pd

In [14]: %timeit pd.Series(arr_stringdtype)
20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [15]: %timeit pd.Series(arr_object)
1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We have also implemented a Pandas extension DType that uses ``StringDType``
under the hood, which is also substantially faster for creating Pandas data
structures than the existing Pandas string DType that uses ``object`` arrays::

In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]')
54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [17]: %timeit pd.Series(arr_object, dtype='string[python]')
1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Backward compatibility
----------------------

We are not proposing a change to DType inference for python strings and do not
expect to see any impacts on existing usages of NumPy, besides warnings or
errors related to new deprecations or expiring deprecations in ``np.char``.