[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 20 13:43:18 EDT 2017

On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
>

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed
number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would
be able to hold, say, N characters (actually code points, combining
characters make this even more confusing) then you would need to allocate
N*4 bytes to make sure you could hold any string that long. Which would be
pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense --
you'd specify say N characters, numpy would arbitrarily (or user specified)
over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
string that didn't fit. Then you'd need to make sure you truncated
correctly, so as not to create an invalid string (that's just code, it
could be made correct).

But how much to over allocate? for english text, with an occasional
scientific symbol, only a little. for, say, Japanese text, you'd need a
factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really
dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
>

sure -- but it is clear to the user that the dtype can hold "up to this
many" characters.

> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.
>

I see it the other way around -- the only reason TO support utf-8 is for
memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for
all unicode support, rather than messing around with all the multiple
encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though --
scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it
would mostly "just work"

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/fa2ff192/attachment.html>