On Tue, Apr 25, 2017 at 6:36 PM Chris Barker <chris.barker@noaa.gov> wrote:

This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be compatible with anything in particular? 

If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout?

Presumably you're getting byte strings (with no NULLs) from somewhere and need to store them in this memory structure in a way that makes them as usable as possible in spite of their unknown encoding. Presumably the thing to do is just copy them in there as-is and then use .astype to arrange for python to decode them when accessed. So this is precisely the problem of "how should I decode random byte strings?" that python has been struggling with. My impression is that the community has established that there's no one solution that makes everyone happy, but that most people can cope with some combination of picking a one-byte encoding, ascii-with-surrogateescapes, zapping bogus characters, and giving wrong results. But I think that all the standard python alternatives are needed, in general, and in terms of interpreting numpy arrays full of bytes. Clearly your preferred solution is .astype("string[latin-9]"), but just as clearly that's not going to work for everyone.

If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type.