Unicode characters in btye-strings
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Fri Mar 12 07:35:57 EST 2010
I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:
>>> s = "éâÄ"
>>> print s
éâÄ
>>> len(s)
6
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']
Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.
In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1,
and repeat the above, I get this:
>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']
If I do this:
>>> s = u"éâÄ"
>>> s.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\x84'
>>> s.encode('iso8859-1')
'\xe9\xe2\xc4'
which at least explains why the bytes have the values which they do.
Thank you,
--
Steven
More information about the Python-list
mailing list