Re: [Numpy-discussion] proposal: smaller representation of string arrays

24 Apr 2017

      On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
...
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com>
...
...
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
...
...
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
...
...
I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
...
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
I appreciate it. What is obvious to you is not obvious to me.
...
Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
...
...
...
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
Out[23]: False
In [24]: dat == b'yes'  # Right answer but not practical
Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very
wrote:
the data to `U` type and incur the 4x memory bloat.
practical solution to me.
...
In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

Okay, so the problem isn't with (byte-)string literals, but with variables
being passed around from other sources. Eg.

def func(dat, scalar):
    return dat == scalar

Every one of those functions deepens the abstraction and moves that
unicode-by-default scalar farther away from the bytesish array, so it's
harder to demand that users of those functions be aware that they need to
pass in `bytes` strings. So you need to implement those functions
defensively, which complicates them.
...
The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.
What do you think about my ASCII-surrogateescape proposal? Do you think
that would work with your use cases?

In general, I don't think Unicode sandwiches will be eliminated by this or
the latin-1 dtype; the sandwich is usually the right thing to do and the
surrogateescape the wrong thing. But I'm keenly aware of the problems you
get when there just isn't a reliable encoding to use.

--
Robert Kern