[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 19:06:56 EDT 2017

On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.kern at gmail.com> wrote:

> I am not unfamiliar with this problem. I still work with files that have
> fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
>

> Can you walk us through the problems that you are having with working with
> these columns as arrays of `bytes`?
>

This is very simple and obvious but I will state for the record.  Reading
an HDF5 file with character data currently gives arrays of `bytes` [1].  In
Py3 this cannot be compared to a string literal, and comparing to (or
assigning from) explicit byte strings everywhere in the code quickly spins
out of control.  This generally forces one to convert the data to `U` type
and incur the 4x memory bloat.

In [22]: dat = np.array(['yes', 'no'], dtype='S3')

In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
Out[23]: False

In [24]: dat == b'yes'  # Right answer but not practical
Out[24]: array([ True, False], dtype=bool)

- Tom

[1]: Using h5py or pytables.  Same with FITS, although astropy.io.fits does
some tricks under the hood to auto-convert to `U` type as needed.

>
>
> > So I would beg to actually move forward with a pragmatic solution that
> addresses very real and consequential problems that we face instead of
> waiting/praying for a perfect solution.
>
> Well, I outlined a solution: work with `bytes` arrays with utilities to
> convert to/from the Unicode-aware string dtypes (or `object`).
>
> A UTF-8-specific dtype and maybe a string-specialized `object` dtype
> address the very real and consequential problems that I face (namely and
> respectively, working with HDF5 and in-memory manipulation of string
> datasets).
>
> I'm happy to consider a latin-1-specific dtype as a second,
> workaround-for-specific-applications-only-you-have-been-
> warned-you're-gonna-get-mojibake option. It should not be *the* Unicode
> string dtype (i.e. named np.realstring or np.unicode as in the original
> proposal).
>
> --
> Robert Kern
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/e735146e/attachment.html>