[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 01:22:23 CET 2010

On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:
> Maybe Python should have used UTF-8 as its internal unicode
> representation. Then people who were foolish enough to assume
> one character per string item would have their programs break
> rather soon under only light unicode testing. :-)

You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints.

Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme (that is: the set of combining characters + base character that go together, making up one thing which a human would think of as a character).

James