On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:

On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern@gmail.com> wrote:
* HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars.

It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats.

I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway. 

Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.

Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.

For  byte strings, it looks like we need a parameterized type. This is for two uses, display and conversion to (Python) unicode. One could handle the display and conversion using view and astype methods. For instance, we already have

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
array(['1', '2', '3'],

In [3]: a.view('S1').astype('U')
array([u'1', u'2', u'3'],