Unicode conversion

Edward K. Ream edream at tds.net
Thu Oct 3 15:10:44 CEST 2002


My app presently will write Unicode in any format the user desires as long
as it is UTF-8 ;-)

Here is the code that I use to translate from the UTF-8 delivered by the Tk
Text widget to the desired encoding:

print `xml_encoding`
# Tk always uses utf-8 encoding.
print `s`,"tk"
s = s.encode("utf-8") # result is a string.
print `s`,"utf-8"
s = s.decode(xml_encoding) # result is unicode.
s = s.encode(xml_encoding) # result is a string.
print `s`,`xml_encoding`

If I start with:

aAßÉd

a
U+0102(Latin Capital Letter A with Breve)
U+00df(Latin Small Letter Sharp S)
U+00c9(Latin Capital Letter E with Acute)
d

and delete the trailing d the output is:

u'a\u0102\xdf\xc9\n' tk
'a\xc4\x82\xc3\x9f\xc3\x89\n' utf-8
'a\xc4\x82\xc3\x9f\xc3\x89\n' 'ISO-8859-1'

As you can see, the result of the two "encodes" are identical. My app writes
the result of the second encode to the file.  Viewing a file (say with MS
Word) with these characters works properly only if UTF-8 is used.  Weird
characters appear when the desired ISO-8859-1 encoding is used.

BTW, with out the first encode/decode pair I can take exceptions in the last
encode.

Can anyone explain what is happening and what I should be doing? I'm totally
confused.  Thanks.

Edward
--------------------------------------------------------------------
Edward K. Ream   email:  edream at tds.net
Leo: Literate Editor with Outlines
Leo: http://personalpages.tds.net/~edream/front.html
--------------------------------------------------------------------






More information about the Python-list mailing list