umlauts

StarWing weasley_wx at sina.com
Sat Oct 17 12:55:24 EDT 2009


On 10月18日, 上午12时14分, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Arian Kuschki wrote:
> > Hi all
>
> > this has been bugging me for a long time and I do not seem to be able to
> > understand what to do. I always have problems when dealing input text that
> > contains umlauts. Consider the following:
>
> > In [1]: import urllib
>
> > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> > In [3]: xml = f.read()
>
> > In [4]: f.close()
>
> > In [5]: print xml
> > ------> print(xml)
> > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
> >> <forecast_information><cit
> > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> > data=""/><longitude_e6 data=""/><forecast_date
> > data="2009-10-17"/><current_date_time data="2009-10
> > -17 14:20:00 +0000"/><unit_system
> > data="SI"/></forecast_information><current_conditions><condition data="Meistens
> > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> > umidity data="Feuchtigkeit: 87 %"/><icon
> > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> > Windgeschwindigkeiten von 13 km/h"/></curr
> > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> > data="1"/><high data="7"/><icon
> > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> > data="So."/><low data="-1"/><high data="8"/><icon
> > data="/ig/images/weather/chance_of_sno
> > w.gif"/><condition data="Vereinzelt
> > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> > mages/weather/mostly_sunny.gif"/><condition data="Teils
> > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Di."/><low data="0"/><high data="8"
> > /><icon data="/ig/images/weather/sunny.gif"/><condition
> > data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> > As you can see the umlauts in the XML are not displayed properly. When I want
> > to process this text (for example with xml.sax), I get error messages because
> > the parses can't read this.
>
> > I've tried to read up on this and there is a lot of information on the web, but
> > nothing seems to work for me. For example setting the coding to UTF like this:
> > # -*- coding: utf-8 -*- or using the decode() string method.
>
> > I always have this kind of problem when input contains umlauts, not just in
> > this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.
>
> You should decode the bytestring to Unicode and then re-encode it to
> UTF-8. I don't know what encoding the website is actually using; here
> I'm assuming ISO-8859-1:
>
> print xml.decode("iso-8859-1").encode("utf-8")

in 2.6, str.decode return unicode, so you can directly print it.
in 3.1, str.encode return bytes, so you can also directly print it.

so, just decode("cp1252"), it's enough.



More information about the Python-list mailing list