Unicode characters in btye-strings

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Mar 12 13:35:57 CET 2010

I know this is wrong, but I'm not sure just how wrong it is, or why. 
Using Python 2.x:

>>> s = "éâÄ"
>>> print s
>>> len(s)
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a 
non-unicode string? My guess is that the result will depend on the 
current encoding of my terminal.

In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1, 
and repeat the above, I get this:

>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']

If I do this:

>>> s = u"éâÄ"
>>> s.encode('utf-8')
>>> s.encode('iso8859-1')

which at least explains why the bytes have the values which they do.

Thank you,


More information about the Python-list mailing list