[Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer shoyer at gmail.com
Thu Apr 20 14:16:34 EDT 2017


On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker <chris.barker at noaa.gov>
wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
>>
>
> Ah, yes -- the nightmare of Unicode!
>
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
>

Apologies for confusing the terminology! Yes, this would mean a fixed
number of bytes and an unknown number of characters.


> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
>

It's already unsafe to try to insert arbitrary length strings into a numpy
string_ or unicode_ array. When determining the dtype automatically (e.g.,
with np.array(list_of_strings)), the difference is that numpy would need to
check the maximum encoded length instead of the character length (i.e.,
len(x.encode() instead of len(x)).

I certainly would not over-allocate. If users want more space, they can
explicitly choose an appropriate size. (This is an hazard of not having
length length dtypes.)

If users really want to be able to fit an arbitrary number of unicode
characters and aren't concerned about memory usage, they can still use
np.unicode_ -- that won't be going away.


> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
>

NumPy already does this sort of silent truncation with longer strings
inserted into shorter string dtypes. The different here would indeed be the
need to check the number of bytes represented by the string instead of the
number of characters.

But I don't think this is useful behavior to bring over to a new dtype. We
should error instead of silently truncating. This is certainly easier than
trying to figure out when we would be splitting a character.


> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
>
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
>
> you need fixed-length storage
> you care about compactness
>
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
>>
>
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>

As Yu Feng points out in this GitHub comment, non-latin language speakers
are already aware of the difference between string length and bytes length:
https://github.com/numpy/numpy/pull/8942#issuecomment-294409192

Making an API based on code units instead of code points really seems like
the saner way to handle unicode strings. I agree with this section with the
DyND design docs for it's string type, which notes precedent from Julia and
Go:
https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code-unit-api-not-code-point

I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
>

I think scientific users tend be to ASCII only, so UTF-8 would also work
transparently :).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/77463a30/attachment-0001.html>


More information about the NumPy-Discussion mailing list