Unicode characters in btye-strings

Martin v. Loewis martin at v.loewis.de
Fri Mar 12 15:56:42 EST 2010


>> Can somebody explain what happens when I put non-ASCII characters into a
>> non-unicode string? My guess is that the result will depend on the
>> current encoding of my terminal.
> 
> Exactly right.

To elaborate on the "what happens" part: the string that gets entered is
typically passed as a byte sequence, from the terminal (application) to
the OS kernel, from the OS kernel to Python's stdin, and from there to
the parser. Python recognizes the string delimiters, but (practically)
leaves the bytes between the delimiters as-is (*), creating a byte
string object with the very same bytes.

The more interesting question is what happens when you do

py> s = u"éâÄ"

Here, Python needs to decode the bytes, according to some encoding.
Usually, it would want to use the source encoding (as given through
-*- Emacs -*- markers), but there are none. Various Python versions then
try different things; what they should do is to determine the terminal
encoding, and decode the bytes according to that one.

Regards,
Martin

(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.



More information about the Python-list mailing list