On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions.

For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).

A little history, IIRC, storing null terminated strings in fixed byte lengths was done in Fortran, strings were  usually stored in integers/integer_arrays.

If memory mapping of arbitrary types is not important, I'd settle for ascii or latin-1, utf-8 fixed byte length, and arrays of fixed python object type. Using one byte encodings and utf-8 avoids needing to deal with endianess.