umlauts

Diez B. Roggisch deets at nospam.web.de
Sun Oct 18 03:41:03 CEST 2009


Diez B. Roggisch schrieb:
> Arian Kuschki schrieb:
>> Whoa, that was quick! Thanks for all the answers, I'll try to 
>> recapitulate
>>
>>> What does this show you in your interactive interpreter?
>>>
>>>>>> print "\xc3\xb6"
>>> ö
>>>
>>> For me, it's o-umlaut, ö. This is because the above bytes are the
>>> sequence for ö in utf-8.
>>>
>>> If this shows something else, you need to adjust your terminal settings.
>>
>> for me it also prints the correct o-umlaut (ö), so that was not the 
>> problem.
>>
>>
>> All of the below result in xml that shows all umlauts correctly when 
>> printed:
>>
>> xml.decode("cp1252")
>> xml.decode("cp1252").encode("utf-8")
>> xml.decode("iso-8859-1")
>> xml.decode("iso-8859-1").encode("utf-8")
>>
>> But when I want to parse the xml then, it only works if I
>> do both decode and encode. If I only decode, I get the following error:
>> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
>>
>> Do I understand right that since the encoding was not specified in the 
>> xml response, it should have been utf-8 by default? And that if it had 
>> indeed been utf-8 I would not have had the encoding problem in the 
>> first place?
> 
> Yes. XML without explicit encoding is implicitly UTF-8, and the page is 
> borked using cp* or latin* without saying so.

Ok, after reading some other posts in this thread this assumption seems 
not to hold. HTTP-protocol allows for other encodings to be implicitly 
given. Which I think is an atrocity.

Diez



More information about the Python-list mailing list