Re: [Numpy-discussion] proposal: smaller representation of string arrays

April 25, 2017

      On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
...
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
...
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern@gmail.com>
wrote:
...
...
I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
...
Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
This is very simple and obvious but I will state for the record.
I appreciate it. What is obvious to you is not obvious to me.
...
Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
In [22]: dat = np.array(['yes', 'no'], dtype='S3')
In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
Out[23]: False
In [24]: dat == b'yes'  # Right answer but not practical
Out[24]: array([ True, False], dtype=bool)
I'm curious why you think this is not practical. It seems like a very
practical solution to me.
In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

- Tom
...
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion