[Tutor] character encoding

wesley chun wescpy at gmail.com
Wed Jul 9 08:12:24 CEST 2008

>  > Hi, I'm puzzled by the character encodings which I get when I use Python
>  > with IDLE. The string '\xf6' represents a letter in the Swedish alphabet
>  > when coded with utf8. On our computer with MacOSX this gets coded as
>  > '\xc3\xb6' which is a string of length 2. I have configured IDLE to encode
>  > utf8 but it doesn't make any difference.
> I think you may be a bit confused about utf-8. '\xf6' is not a utf-8
>  character. U00F6 is the Unicode (not utf-8) codepoint for LATIN SMALL
>  LETTER O WITH DIAERESIS. '\xf6' is also the Latin-1 encoding of this
>  character. The utf-8 encoding of this character is the two-byte
>  sequence '\xc3\xb6'.
> Also you might want to do some background reading on Unicode;
>  this is a good place to start:
>  http://www.joelonsoftware.com/articles/Unicode.html

kent is quite correct, and here is some Python code to demo it:

>>> x = u'\xf6'
>>> x
>>> print x
>>> y = x.encode('utf-8')
>>> y
>>> print y

in the code above, our source string 'x' is a Unicode string, which is
"pure," meaning that it has not been encoded by any codec. we encode
this Unicode string into a UTF-8 binary string 'y', which takes up 2
bytes as Kent has mentioned already. we are able to dump the variables
as well as print them fine to the screen because our terminal was set
to UTF-8.

if we switch our terminal output to Latin-1, then we can view it that
way -- notice that the Latin-1 encoding only takes 1 byte instead of 2
for UTF-8:

>>> z = x.encode('latin-1')
>>> z
>>> print z

here's another recommended Unicode document that is slightly more

-- wesley
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
"Core Python Programming", Prentice Hall, (c)2007,2001

wesley.j.chun :: wescpy-at-gmail.com
python training and technical consulting
cyberweb.consulting : silicon valley, ca

More information about the Tutor mailing list