On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com>
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly.
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
I appreciate it. What is obvious to you is not obvious to me.
Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False
In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very
wrote: the data to `U` type and incur the 4x memory bloat. practical solution to me.
In Py3 most character data will be string, not bytes. So every time you
want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. Okay, so the problem isn't with (byte-)string literals, but with variables being passed around from other sources. Eg. def func(dat, scalar): return dat == scalar Every one of those functions deepens the abstraction and moves that unicode-by-default scalar farther away from the bytesish array, so it's harder to demand that users of those functions be aware that they need to pass in `bytes` strings. So you need to implement those functions defensively, which complicates them.
The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate.
What do you think about my ASCII-surrogateescape proposal? Do you think that would work with your use cases? In general, I don't think Unicode sandwiches will be eliminated by this or the latin-1 dtype; the sandwich is usually the right thing to do and the surrogateescape the wrong thing. But I'm keenly aware of the problems you get when there just isn't a reliable encoding to use. -- Robert Kern