[Numpy-discussion] proposal: smaller representation of string arrays

Charles R Harris charlesr.harris at gmail.com
Thu Apr 20 15:24:35 EDT 2017

On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
> We should look at some of the newer formats and APIs, like Parquet and
> Arrow, and also consider the cross-language APIs with Julia and R.
> If I had to jump ahead and propose new dtypes, I might suggest this:
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
A little history, IIRC, storing null terminated strings in fixed byte
lengths was done in Fortran, strings were  usually stored in

If memory mapping of arbitrary types is not important, I'd settle for ascii
or latin-1, utf-8 fixed byte length, and arrays of fixed python object
type. Using one byte encodings and utf-8 avoids needing to deal with

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/4865d472/attachment.html>

More information about the NumPy-Discussion mailing list