On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald <peridot.faceted@gmail.com> wrote:
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.kern@gmail.com> wrote:
For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string.
Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped.
If I had to jump ahead and propose new dtypes, I might suggest this:
* For the most part, treat the string dtypes as temporary communication
* Acknowledge the use cases of the current NULL-terminated np.string
* Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
Never mind, then. :-) formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied.
How would this differ from a numpy array of bytes with one more
* Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.).
I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually
dimension? The scalar in the implementation being the scalar in the use case, immutability of the scalar, directly working with b'' strings in and out (and thus work with the Python codecs easily). this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. HDF5 seems to support this, but only for ASCII and UTF8, not a large list of encodings. -- Robert Kern