[Tutor] accented characters to unaccented
Peter Otten
__peter__ at web.de
Tue Jun 8 09:59:31 CEST 2010
KB SU wrote:
> Hi,
>
> I have open url and read like following:
>
> $import urllib
> $txt = urllib.urlopen("http://www.terme-catez.si").read()
> $txt
> If you see above, in junk of HTLM, there is text like 'Terme
> \xc4\x8cate\xc5\xbe' (original is 'Terme Čatež'). Now, I want to convert
> code like '\xc4\x8c' or '\xc5\xbe' to unaccented chars so that 'Terme
> \xc4\x8cate\xc5\xbe' become 'Terme Catez'. Is there any way convert from
> whole HTML.
First convert to unicode with
txt = txt.decode("utf-8") and then follow
http://effbot.org/zone/unicode-convert.htm
Peter
More information about the Tutor
mailing list