PEP 263 status check

"Martin v. Löwis" martin at
Fri Aug 6 14:20:41 CEST 2004

John Roth wrote:
> Or are you trying to say that the character string will
> contain the UTF-8 encoding of these characters; that
> is, if I do a subscript, I will get one character of the
> multi-byte encoding?

Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.

> The point of this is that I don't think that either behavior
> is what one would expect. It's also an open invitation
> for someone to make an unchecked mistake! I think this
> may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?


More information about the Python-list mailing list