On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
> On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com> wrote:
>> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
>> Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`?
> This is very simple and obvious but I will state for the record. 

I appreciate it. What is obvious to you is not obvious to me.

> Reading an HDF5 file with character data currently gives arrays of `bytes` [1].  In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control.  This generally forces one to convert the data to `U` type and incur the 4x memory bloat.
> In [22]: dat = np.array(['yes', 'no'], dtype='S3')
> In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> Out[23]: False
> In [24]: dat == b'yes'  # Right answer but not practical
> Out[24]: array([ True, False], dtype=bool)

I'm curious why you think this is not practical. It seems like a very practical solution to me.

In Py3 most character data will be string, not bytes.  So every time you want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object.  For code that developers write, this might be possible but results in ugly code.  But for the general science and engineering communities that use numpy this is completely untenable.  

The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface.  That is precisely what we are trying to eliminate.

- Tom

Robert Kern

NumPy-Discussion mailing list