[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 15:09:06 EDT 2017

On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>>> In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for
specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the
string, and maybe only if there are more than N non-ascii characters. i.e.
it is very likely to be a run-time error that may not have shown up in
tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
they need to truncate it, and THAT is non-obvious how to do, too -- you
don't want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

If this becomes the right strategy for dealing with these problems (and I'm
not sure that it is), we can easily make a utility function that does this
for people.

This discussion is why I want to be sure that we have our use cases
actually mapped out. For this kind of in-memory manipulation, I'd use an
object array (a la pandas), then convert to the uniform-width string dtype
when I needed to push this out to a C API, HDF5 file, or whatever actually
requires a string-dtype array. The required width gets computed from the
data after all of the manipulations are done. Doing in-memory assignments
to a fixed-encoding, fixed-width string dtype will always have this kind of
problem. You should only put up with it if you have a requirement to write
to a format that specifies the width and the encoding. That specified
encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that
your data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/76821bbe/attachment-0001.html>