Now my proposal for the other use cases:

2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not to be wasting space for "typical european-language-oriented data". Note: this should ALSO be compatible with Python's character-oriented string model. i.e. a Python String with length N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned with the size of the data storage and know that are using european text.

more detail elsewhere -- but either ascii with surrageescape or latin-1 always are good options here. I prefer latin-1 (I really see no downside), but others disagree...

But then we get to:
 
3) dtypes that support storage in particular encodings:

We need utf-8. We may need others. We may need a 1-byte per char compact encoding that isn't close enough to ascii or latin-1 to be useful (say, shift-jis), And I don't think we are going to come to a consensus on what "single" encoding to use for 1-byte-per-char.

So really -- going back to Julian's earlier proposal:

dytpe with an encoding specified
"size" in bytes

once defined, numpy would encode/decode to/from python strings "correctly"

we might need "null-terminated utf-8" as a special case.

That would support all the other use cases.

Even the one-byte per char encoding. I"d like to see a clean alias to a latin-1 encoding, but not a big deal.

That leaves a couple decisions: 

 - error out or truncate if the passed-in string is too long?

 - error out or suragateescape if there are invalid bytes in the data?

 - error out or something else if there are characters that can't be encoded in the specified encoding.

And we still need a proper bytes type:

4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object (or other memoryview?), and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding with no change in binary representation. This could be used to store any binary data, including encoded text or anything else. this should map directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that with 'S', a if ALL the trailing bytes are null, then it is truncated, but if there is a null byte in the middle, then it is preserved. I suspect that this is a legacy from Py2's use of "strings" as both text and binary data. But in py3, a "bytes" type should be about bytes, and not text, and thus null-values bytes are simply another value a byte can hold.

 
--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov