[Numpy-discussion] proposal: smaller representation of string arrays

Eric Wieser wieser.eric+numpy at gmail.com
Thu Apr 20 13:58:26 EDT 2017

> if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation
either, as combining characters are a thing. So I don't think this is a
good argument against UTF8.

Also, is silent truncation a think that we want to allow to happen anyway?
That sounds like something the user ought to be alerted to with an

> if you wanted to specify that a numpy element would be able to hold, say,
N characters
> ...
> It simply is not the right way to handle text if [...] you need
fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due
to combining characters. The only two meaningful things to count are:
* Graphemes, as that's what the user sees visually. These can span multiple
* Bytes of encoded data, as that's the space needed to store them

So I would argue that the approach of fixed-codepoint-length storage is
itself a flawed design, and so should not be used as a constraint on numpy.

Counting graphemes is hard, so that leaves the only sensible option as a
byte count.

I don't forsee variable-length encodings being a problem
implementation-wise - they only become one if numpy were to acquire a
vectorized substring function that is intended to return a view.

I think I'd be in favor of supporting all encodings, and falling back on
python to handle encoding/decoding them.

On Thu, 20 Apr 2017 at 18:44 Chris Barker <chris.barker at noaa.gov> wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
> Ah, yes -- the nightmare of Unicode!
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
> you need fixed-length storage
> you care about compactness
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>> The only reason I see for supporting encodings other than UTF-8 is for
>> memory-mapping arrays stored with those encodings, but that seems like a
>> lot of extra trouble for little gain.
> I see it the other way around -- the only reason TO support utf-8 is for
> memory mapping with other systems that use it :-)
> On the other hand,  if we ARE going to support utf-8 -- maybe use it for
> all unicode support, rather than messing around with all the multiple
> encoding options.
> I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
> All that being said, if the truncation code were carefully written, it
> would mostly "just work"
> -CHB
> --
> Christopher Barker, Ph.D.
> Oceanographer
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/c9719352/attachment-0001.html>

More information about the NumPy-Discussion mailing list