[Python-Dev] len(chr(i)) = 2?

James Y Knight foom at fuhm.net
Wed Nov 24 07:26:11 CET 2010


On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:
> Or you can give user programs memory indicies, and enjoy the fun as
> the poor developers do things like "pos += 1" which works fine on
> the ASCII data they have lying around, then wonder why they get
> Unicode errors when they take substrings.


a) You seem to be hung up implementation details of emacs. But yes, positions should be stored as an byte offset into the utf8 string. NOT as number of codepoints since the beginning of the string. Probably you want it to be somewhat opaque, so that you actually have to specify whether you wanted to go to +1 byte, codepoint, or grapheme.

b) Those poor developers are *already* screwed if they're using pos += 1 when pos is a codepoint index and they then take a substring based on that! They will get half a character when the string contains combining characters...

Pretending that "codepoints" are a useful abstraction just makes poor developers get by without doing the correct thing (incrementing to the next grapheme boundary) for a little bit longer. But once you [the language implementor] are providing correct abstractions for grapheme movement, it's just as easy to also provide an abstraction for codepoint movement, and make your low-level implementation of the iterator object be a byte-offset into a UTF8 buffer.

James


More information about the Python-Dev mailing list