[Python-Dev] len(chr(i)) = 2?

Glyph Lefkowitz glyph at twistedmatrix.com
Wed Nov 24 02:52:13 CET 2010


On Nov 23, 2010, at 7:22 PM, James Y Knight wrote:

> On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:
>> Maybe Python should have used UTF-8 as its internal unicode
>> representation. Then people who were foolish enough to assume
>> one character per string item would have their programs break
>> rather soon under only light unicode testing. :-)
> 
> You put a smiley, but, in all seriousness, I think that's actually the right thing to do if anyone writes a new programming language. It is clearly the right thing if you don't have to be concerned with backwards-compatibility: nobody really needs to be able to access the Nth codepoint in a string in constant time, so there's not really any point in storing a vector of codepoints.
> 
> Instead, provide bidirectional iterators which can traverse the string by byte, codepoint, or by grapheme (that is: the set of combining characters + base character that go together, making up one thing which a human would think of as a character).


I really hope that this idea is not just for new programming languages.  If you switch from doing unicode "wrong" to doing unicode "right" in Python, you quadruple the memory footprint of programs which primarily store and manipulate large amounts of text.

This is especially ridiculous in PyGTK applications, where the GUI's internal representation required by the GUI UTF-8 anyway, so the round-tripping of string data back and forth to the exploded UTF-32 representation is wasting gobs of memory and time.  It at least makes sense when your C library's idea about character width and your Python build match up.

But, in a desktop app this is unlikely to be a performance concern; in servers, it's a big deal; measurably so.  I am pretty sure that in the server apps that I work on, we are eventually going to need our own string type and UTF-8 logic that does exactly what James suggested - certainly if we ever hope to support Py3.

(I dimly recall that both James and I have made this point before, but it's pretty important, so it bears repeating.)



More information about the Python-Dev mailing list