On Tue, Aug 29, 2023 at 4:08 PM Nathan <nathan.goldbaum@gmail.com> wrote:

The NEP was merged in draft form, see below.

https://numpy.org/neps/nep-0055-string_dtype.html

This is a really nice NEP, thanks Nathan! I see that questions and constructive feedback is still coming in on GitHub, but for now it seems like everyone is pretty happy with moving forward with implementing this new dtype in NumPy.

Cheers,

Rafl

On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Hello all,

I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483.

Per NEP 0, I've copied everything up to the "detailed description" section below.

I'm looking forward to your feedback on this.

-Nathan Goldbaum

=========================================================

NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy

=========================================================

:Author: Nathan Goldbaum <ngoldbaum@quansight.com>

:Status: Draft

:Type: Standards Track

:Created: 2023-06-29

Abstract

--------

We propose adding a new string data type to NumPy where each item in the array

is an arbitrary length UTF-8 encoded string. This will enable performance,

memory usage, and usability improvements for NumPy users, including:

* Memory savings for workflows that currently use fixed-width strings and store

primarily ASCII data or a mix of short and long strings in a single NumPy

array.

* Downstream libraries and users will be able to move away from object arrays

currently used as a substitute for variable-length string arrays, unlocking

performance improvements by avoiding passes over the data outside of NumPy.

* A more intuitive user-facing API for working with arrays of Python strings,

without a need to think about the in-memory array representation.

Motivation and Scope

--------------------

First, we will describe how the current state of support for string or

string-like data in NumPy arose. Next, we will summarize the last major previous

discussion about this topic. Finally, we will describe the scope of the proposed

changes to NumPy as well as changes that are explicitly out of scope of this

proposal.

History of String Support in Numpy

**********************************

Support in NumPy for textual data evolved organically in response to early user

needs and then changes in the Python ecosystem.

Support for strings was added to numpy to support users of the NumArray

``chararray`` type. Remnants of this are still visible in the NumPy API:

string-related functionality lives in ``np.char``, to support the obsolete

``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string

DTypes.

NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str``

type before Python 3 support was added to NumPy. The bytes DType makes the most

sense when it is used to represent Python 2 strings or other null-terminated

byte sequences. However, ignoring data after the first null character means the

``bytes_`` DType is only suitable for bytestreams that do not contain nulls, so

it is a poor match for generic bytestreams.

The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It

stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for

a straightforward implementation, but is inefficient for storing text that can

be represented well using a one-byte ASCII or Latin-1 encoding. This was not a

problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2

``str`` DType (the current ``bytes_`` DType).

With the arrival of Python 3 support in NumPy, the string DTypes were largely

left alone due to backward compatibility concerns, although the unicode DType

became the default DType for ``str`` data and the old ``string`` DType was

renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal

situation of shipping a data type originally intended for null-terminated

bytestrings as the data type for *all* python ``bytes`` data, and a default

string type with an in-memory representation that consumes four times as much

memory as needed for ASCII or mostly-ASCII data.

Problems with Fixed-Width Strings

*********************************

Both existing string DTypes represent fixed-width sequences, allowing storage of

the string data in the array buffer. This avoids adding out-of-band storage to

NumPy, however, it makes for an awkward user interface. In particular, the

maximum string size must be inferred by NumPy or estimated by the user before

loading the data into a NumPy array or selecting an output DType for string

operations. In the worst case, this requires an expensive pass over the full

dataset to calculate the maximum length of an array element. It also wastes

memory when array elements have varying lengths. Pathological cases where an

array stores many short strings and a few very long strings are particularly bad

for wasting memory.

Downstream usage of string data in NumPy arrays has proven out the need for a

variable-width string data type. In practice, most downstream users employ

``object`` arrays for this purpose. In particular, ``pandas`` has explicitly

deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width

string arrays to ``object`` arrays, and in the future may switch to only

supporting string data via ``PyArrow``, which has native support for UTF-8

encoded variable-width string arrays [1]_. This is unfortunate, since ``object``

arrays have no type guarantees, necessitating expensive sanitization passes and

operations using object arrays cannot release the GIL.

Previous Discussions

--------------------

The project last discussed this topic in depth in 2017, when Julian Taylor

proposed a fixed-width text data type parameterized by an encoding [2]_. This

started a wide-ranging discussion about pain points for working with string data

in NumPy and possible ways forward.

In the end, the discussion identified two use-cases that the current support for

strings does a poor job of handling:

* Loading or memory-mapping scientific datasets with unknown encoding,

* Working with string data in a manner that allows transparent conversion

between NumPy arrays and Python strings, including support for missing

strings.

As a result of this discussion, improving support for string data was added to

the NumPy project roadmap [3]_, with an explicit call-out to add a DType better

suited to memory-mapping bytes with any or no encoding, and a variable-width

string DType that supports missing data to replace usages of object string

arrays.

Proposed work

-------------

This NEP proposes adding ``StringDType``, a DType that stores variable-width

heap-allocated strings in Numpy arrays, to replace downstream usages of the

``object`` DType for string data. This work will heavily leverage recent

improvements in NumPy to improve support for user-defined DTypes, so we will

also necessarily be working on the data type internals in NumPy. In particular,

we propose to:

* Add a new variable-length string DType to NumPy, targeting NumPy 2.0.

* Work out issues related to adding a DType implemented using the experimental

DType API to NumPy itself.

* Support for a user-provided missing data sentinel.

* A cleanup of ``np.char``, with the ufunc-like functions moved to a new

namespace for functions and types related to string support.

* An update to the ``npy`` and ``npz`` file formats to allow storage of

arbitrary-length sidecar data.

The following is out of scope for this work:

* Changing DType inference for string data.

* Adding a DType for memory-mapping text in unknown encodings or a DType that

attempts to fix issues with the ``bytes_`` DType.

* Fully agreeing on the semantics of a missing data sentinels or adding a

missing data sentinel to NumPy itself.

* Implement fast ufuncs or SIMD optimizations for string operations.

While we're explicitly ruling out implementing these items as part of this work,

adding a new string DType helps set up future work that does implement some of

these items.

If implemented this NEP will make it easier to add a new fixed-width text DType

in the future by moving string operations into a long-term supported

namespace. We are also proposing a memory layout that should be amenable to

writing fast ufuncs and SIMD optimization in some cases, increasing the payoff

for writing string operations as SIMD-optimized ufuncs in the future.

While we are not proposing adding a missing data sentinel to NumPy, we are

proposing adding support for an optional, user-provided missing data sentinel,

so this does move NumPy a little closer to officially supporting missing

data. We are attempting to avoid resolving the disagreement described in

:ref:`NEP 26<NEP26>` and this proposal does not require or preclude adding a

missing data sentinel or bitflag-based missing data support in the future.

Usage and Impact

----------------

The DType is intended as a drop-in replacement for object string arrays. This

means that we intend to support as many downstream usages of object string

arrays as possible, including all supported NumPy functionality. Pandas is the

obvious first user, and substantial work has already occurred to add support in

a fork of Pandas. ``scikit-learn`` also uses object string arrays and will be

able to migrate to a DType with guarantees that the arrays contains only

strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class

support for variable-width UTF-8 encoded string datasets in HDF5. String data

are heavily used in machine-learning workflows and downstream machine learning

libraries will be able to leverage this new DType.

Users who wish to load string data into NumPy and leverage NumPy features like

fancy advanced indexing will have a natural choice that offers substantial

memory savings over fixed-width unicode strings and better validation guarantees

and overall integration with NumPy than object string arrays. Moving to a

first-class string DType also removes the need to acquire the GIL during string

operations, unlocking future optimizations that are impossible with object

string arrays.

Performance

***********

Here we briefly describe preliminary performance measurements of the prototype

version of ``StringDType`` we have implemented outside of NumPy using the

experimental DType API. All benchmarks in this section were performed on a Dell

XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy,

Pandas, and the ``StringDType`` prototype were all compiled with meson release

builds.

Currently, the ``StringDType`` prototype has comparable performance with object

arrays and fixed-width string arrays. One exception is array creation from

python strings, performance is somewhat slower than object arrays and comparable

to fixed-width unicode arrays::

In [1]: from stringdtype import StringDType

In [2]: import numpy as np

In [3]: data = [str(i) * 10 for i in range(100_000)]

In [4]: %timeit arr_object = np.array(data, dtype=object)

3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())

12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit arr_strdtype = np.array(data, dtype=str)

11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In this example, object DTypes are substantially faster because the objects in

the ``data`` list can be directly interned in the array, while ``StrDType`` and

``StringDType`` need to copy the string data and ``StringDType`` needs to

convert the data to UTF-8 and perform additional heap allocations outside the

array buffer. In the future, if Python moves to a UTF-8 internal representation

for strings, the string loading performance of ``StringDType`` should improve.

String operations have similar performance::

In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)

30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.char.capitalize(arr_stringdtype)

38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit np.char.capitalize(arr_strdtype)

46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The poor performance here is a reflection of the slow iterator-based

implementation of operations in ``np.char``. If we were to rewrite these

operations as ufuncs, we could unlock substantial performance

improvements. Using the example of the ``add`` ufunc, which we have implemented

for the ``StringDType`` prototype::

In [10]: %timeit arr_object + arr_object

10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit arr_stringdtype + arr_stringdtype

5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)

65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As described below, we have already updated a fork of Pandas to use a prototype

version of ``StringDType``. This demonstrates the performance improvements

available when data are already loaded into a NumPy array and are passed to a

third-party library. Currently Pandas attempts to coerce all ``str`` data to

``object`` DType by default, and has to check and sanitize existing ``object``

arrays that are passed in. This requires a copy or pass over the data made

unnecessary by first-class support for variable-width strings in both NumPy and

Pandas::

In [13]: import pandas as pd

In [14]: %timeit pd.Series(arr_stringdtype)

20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [15]: %timeit pd.Series(arr_object)

1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We have also implemented a Pandas extension DType that uses ``StringDType``

under the hood, which is also substantially faster for creating Pandas data

structures than the existing Pandas string DType that uses ``object`` arrays::

In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]')

54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [17]: %timeit pd.Series(arr_object, dtype='string[python]')

1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Backward compatibility

----------------------

We are not proposing a change to DType inference for python strings and do not

expect to see any impacts on existing usages of NumPy, besides warnings or

errors related to new deprecations or expiring deprecations in ``np.char``.

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ralf.gommers@googlemail.com