[Tutor] accented characters to unaccented

Peter Otten __peter__ at web.de
Tue Jun 8 09:59:31 CEST 2010


KB SU wrote:

> Hi,
> 
> I have open url and read like following:
> 
> $import urllib
> $txt = urllib.urlopen("http://www.terme-catez.si").read()
> $txt

> If you see above, in junk of HTLM, there is text like 'Terme
> \xc4\x8cate\xc5\xbe'  (original is 'Terme Čatež'). Now, I want to convert
> code like '\xc4\x8c' or '\xc5\xbe' to unaccented chars so that 'Terme
> \xc4\x8cate\xc5\xbe' become 'Terme Catez'. Is there any way convert from
> whole HTML.

First convert to unicode with 

txt = txt.decode("utf-8") and then follow

http://effbot.org/zone/unicode-convert.htm


Peter



More information about the Tutor mailing list