On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern@gmail.com> wrote:
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating.

Sorry about that -- I was trying to keep an already really long thread from getting eve3n longer....

And I'm not sure it matters who's doing the advocating, but rather *what* is being advocated -- I hope I didn't screw that up too badly.

Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear.

So I'll try again -- use-case only! we'll keep the possible solutions separate.

Do we need to write up a NEP for this? it seems we are going a bit in circles, and we really do want to capture the final decision process.

1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py and the like, or the ones writing Cython bindings to C++ libs.

 
2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not to be wasting space for "typical european-language-oriented data". Note: this should ALSO be compatible with Python's character-oriented string model. i.e. a Python String with length N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned with the size of the data storage and know that are using european text.

3) dtypes that support storage in particular encodings:

   Python strings would be encoded appropriately when put into the array. A Python string would be returned when indexing.

   a) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange           with other systems (netcdf, HDF, others???) at the binary level.

   b) There be a dtype that could store data in any encoding supported by Python -- to facilitate bytes-level interchange with other systems. If we need more than utf-8, then we might as well have the full set.
 
4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object (or other memoryview?), and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding with no change in binary representation. This could be used to store any binary data, including encoded text or anything else. this should map directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that with 'S', a if ALL the trailing bytes are null, then it is truncated, but if there is a null byte in the middle, then it is preserved. I suspect that this is a legacy from Py2's use of "strings" as both text and binary data. But in py3, a "bytes" type should be about bytes, and not text, and thus null-values bytes are simply another value a byte can hold.

There are multiple ways to address these use cases -- please try to make your comments clear about whether you think the use-case is unimportant, or ill-defined, or if you think a given solution is a poor choice.

To facilitate that, I will put my comments on possible solutions in a separate note, too.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov