[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 14:53:53 EDT 2017

On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the
current sub-optimal situation because we never really laid out the use
cases. We just felt like we needed bytestring and unicode dtypes, more out
of completionism than anything, and we made a bunch of assumptions just to
get each one done. I think there may be broad agreement that many of those
assumptions are "wrong", but it would be good to reference that against
concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code,
I'm going to use dtype=object a la pandas. It has almost no arbitrary
constraints, and I can rely on Python's unicode facilities freely. There
may be some cases where it's a little less memory-efficient (e.g.
representing a column of enumerated single-character values like 'M'/'F'),
but that's never prevented me from doing anything (compare to the
uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data
conveniently laid out according to numpy assumptions (e.g. FITS). Being
able to work with C/C++/Fortran APIs that have arrays of strings laid out
according to numpy assumptions (e.g. HDF5). I think it would behoove us to
canvass the needs of these formats and APIs before making any more
assumptions.

For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and
Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype,
but perhaps add a new canonical alias, document it as being for those
specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/515c0c94/attachment.html>