umlauts

Diez B. Roggisch deets at nospam.web.de
Sat Oct 17 12:54:10 EDT 2009


MRAB schrieb:
> Arian Kuschki wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able 
>> to understand what to do. I always have problems when dealing input 
>> text that contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f = 
>> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" 
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>>> <forecast_information><cit
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 
>> data=""/><longitude_e6 data=""/><forecast_date 
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system 
>> data="SI"/></forecast_information><current_conditions><condition 
>> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87�%"/><icon 
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition 
>> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low 
>> data="1"/><high data="7"/><icon 
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt 
>> Regen"/></forecast_conditions><forecast_conditions><day_of_week 
>> data="So."/><low data="-1"/><high data="8"/><icon 
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt 
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week 
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils 
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week 
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition 
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When 
>> I want to process this text (for example with xml.sax), I get error 
>> messages because the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the 
>> web, but nothing seems to work for me. For example setting the coding 
>> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string 
>> method.
>>
>> I always have this kind of problem when input contains umlauts, not 
>> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.

This is wierd. I looked at the site in FireFox - and it was displayed 
correctly, including umlauts. Bringing up the info-dialog claims the 
page is UTF-8, the XML itself says so as well (implicit, through the 
missing declaration of an encoding) - but it clearly is *not* utf-8.

One would expect google to be better at this...

Diez



More information about the Python-list mailing list