[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 15:17:39 EDT 2017

On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>

+1

> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>

Actually if I understood the spec, FITS header lines are 80 bytes long and
contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

[...]

> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>

How would this differ from a numpy array of bytes with one more dimension?

> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>

I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps?  Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/9c017865/attachment-0001.html>