On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern@gmail.com> wrote:
> I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

Which is why a one-byte-per char encoding is a good idea.

Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)

I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively? 
 
or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements.

yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing.

So here I'll lay out what I think are the fundamental requirements:

1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:

arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters

and arr[1] will return a native Python string object.

2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data".

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD) 

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding string in would raise an Encoding Error.

I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the encoding in this case.

3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???)

4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object, and returns a bytes object.
 - you could use astype() to convert between bytes and a specified encoding with no change in binary representation.

2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes.

1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features.

> +1.  The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255.  Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective.

That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding.

Well, yes -- BUT:  That strictness in python3 -- "data is either text or bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit Python3 is the butt for a long time. Folks that deal in the messy real world of binary data that is kinda-mostly text, but may have a bit of binary data, or be in an unknown encoding, or be corrupted were very, very adamant about how this model DID NOT work for them. Very influential people were seriously critical of python 3. Eventually, py3 added bytes string formatting, surrogate_escape, and other features that facilitate working with messy almost text.

Practicality beats purity -- if you have one-byte per char data that is mostly european, than latin-1 or latin-9 let you work with it, have it mostly work, and never crash out with an encoding error.

> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me.

latin-1 would be only for the special case of mostly-ascii (or true latin) one-byte-per-char encodings (which is a common use-case in scientific data sets). I think it has only upside over ascii. It would be a fine idea to support any one-byte-per-char encoding, too.

As for external data in utf-8 -- yes that should be dealt with properly -- either by truly supporting utf-8 internally, or by properly encoding/decoding when putting it in and  moving it out of an array.

utf-8 is a very important encoding -- I just think it's the wrong one for the default interplay with python strings.

 Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1!

of course not -- if you are writing to a format that specifies a width and the encoding, you want o use bytes :-) -- or a dtype that is properly encoding-aware. I was not suggesting that latin-1 be used for arbitrary bytes -- that is what bytes are for.

> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
 
But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array?

of course not -- see above. 

I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake option.

well, it wouldn't create mojibake - anything that went from a python string to a latin-1 array would be properly encoded in latin-1 -- unless is came from already corrupted data. but when you have corrupted data, your only choices are to:

 - raise an error
 - alter the data (error-"replace")
 - pass the corrupted data on through.

but it could deal with mojibake -- that's the whole point :-)
 
It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal).

God no -- sorry if it looked like I was suggesting that. I only suggest that it might be *the* one-byte-per-char string type  

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov