Unicode characters in btye-strings

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Mar 12 07:35:57 EST 2010


I know this is wrong, but I'm not sure just how wrong it is, or why. 
Using Python 2.x:

>>> s = "éâÄ"
>>> print s
éâÄ
>>> len(s)
6
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a 
non-unicode string? My guess is that the result will depend on the 
current encoding of my terminal.

In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1, 
and repeat the above, I get this:

>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']

If I do this:

>>> s = u"éâÄ"
>>> s.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\x84'
>>> s.encode('iso8859-1')
'\xe9\xe2\xc4'

which at least explains why the bytes have the values which they do.


Thank you,



-- 
Steven



More information about the Python-list mailing list