umlauts

Diez B. Roggisch deets at nospam.web.de
Sun Oct 18 03:28:40 CEST 2009


Arian Kuschki schrieb:
> Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate
> 
>> What does this show you in your interactive interpreter?
>>
>>>>> print "\xc3\xb6"
>> ö
>>
>> For me, it's o-umlaut, ö. This is because the above bytes are the
>> sequence for ö in utf-8.
>>
>> If this shows something else, you need to adjust your terminal settings.
> 
> for me it also prints the correct o-umlaut (ö), so that was not the problem.
> 
> 
> All of the below result in xml that shows all umlauts correctly when printed:
> 
> xml.decode("cp1252")
> xml.decode("cp1252").encode("utf-8")
> xml.decode("iso-8859-1")
> xml.decode("iso-8859-1").encode("utf-8")
> 
> But when I want to parse the xml then, it only works if I
> do both decode and encode. If I only decode, I get the following error:
> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
> 
> Do I understand right that since the encoding was not specified in the xml 
> response, it should have been utf-8 by default? And that if it had indeed been utf-8 I 
> would not have had the encoding problem in the first place?

Yes. XML without explicit encoding is implicitly UTF-8, and the page is 
borked using cp* or latin* without saying so.


Diez



More information about the Python-list mailing list