On Apr 21, 2017 2:34 PM, "Stephan Hoyer" <shoyer@gmail.com> wrote:
I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

You may already know this, but probably not everyone reading does: the reason why latin1 often gets special attention in discussions of Unicode encoding is that latin1 is effectively "ucs1". It's the unique one byte text encoding where byte N represents codepoint U+N.

I can't think of any reason why this property is particularly important for numpy's usage, because we always have a conversion step anyway to get data in and out of an array. The potential arguments for latin1 that I can think of are:
- if we have to implement our own en/decoding code for some reason then it's the most trivial encoding
- if other formats standardize on latin1-with-nul-padding and we want in-memory/mmap compatibility
- if we really want a fixed width encoding for some reason but don't care which one, then it's in some sense the most obvious choice

I can't think of many reasons why having a fixed width encoding is particularly important though... For our current style of string storage, even calculating the length of a string is O(n), and AFAICT the only way to actually take advantage of the theoretical O(1) character indexing is to make a uint8 view. I guess it would be useful if we had a string slicing ufunc... But why would we?

That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.

From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't.