On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker@noaa.gov> wrote:
Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python strings but doesn't care about the underlying representation. 
-> Solvable with object arrays, or Robert's string-specific object arrays; underlying representation is python objects on the heap. Sadly UCS-4, so zillions are going to be a memory problem.

2) User has to deal with fixed-width binary data from an external program/library and wants to see it as python strings. This may be systematically encoded in a known encoding (e.g. HDF5's fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS' zero-padded ASCII) or ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length may be signaled by null termination, null padding, or space padding.
-> Solvable with a fixed-storage-size encoded-string dtype, as long as it has a parameter for how length is signaled. Python tricks for dealing with wrong or unknown encodings can make bogus data manageable.

3) User has to deal with fixed-width binary data from an external program/library that really is binary bytes.
-> Solvable with a dtype that returns fixed-length byte strings.

4) User has a stupendous number (billions) of short strings which are mostly but not entirely ASCII and wants to manipulate them as strings.
-> Not sure how to solve this. Maybe an object array with byte strings for storage and encoding information in the dtype, allowing transparent decoding? Or a fixed-storage-size array with a one-byte encoding that can cope with all the characters the user will ever want to use?

5) User has a bunch of mystery-encoding strings(?) and wants to store them in a numpy array.
-> If they're python strings already, no further harm is done by treating this as case 1 when in python-land. If they need to be in fixed-width fields for communication with an external program or library, this puts us in case 2, unknown encoding variety; user will have to pick an encoding that the external program is likely to be able to cope with; this may be the one that originated the mystery strings in the first place.

6) User has python strings and wants to store them in non-object numpy arrays for some reason but doesn't care about the actual memory layout.
-> Solvable with the current setup; fixed-width UCS-4 fields, padded with Unicode NULL. Happily, this comes for free from arbitrary-encoding fixed-storage-size dtypes, though a friendlier interface might be nice. Also allows people to use UCS-2 or ASCII if they know their strings fit.

7) User has data in one binary format and it needs to go into another, with perhaps casual inspection while in python-land. Such data is mostly ASCII but might contain mystery characters; presenting gobbledygook to the user is okay as long as the characters are output intact.
-> Reading and writing as a fixed-width one-byte encoding, preferably one resembling the one the data is actually in, should work here. UTF-8 is likely to mangle the data; ASCII-with-surrogateescape might do okay. The key thing here is that both input and output files will have their own ways of specifying string length and their own storage specifiers; user must know these, and someone has to know and specify what to do with strings that don't fit. Simple truncation will mangle UTF-8 if it is not known to be UTF-8, but there's maybe not much that can be done about that.

I guess my point is that a use case should specify:
* Where does the data come from (i.e. in what format)?
* Are there memory constraints in the storage format?
* How should access look to the user? In particular, what should misencoded data look like?
* Where does the data need to go?