[Tutor] accented characters to unaccented
KB SU
k247d0 at gmail.com
Tue Jun 8 08:48:31 CEST 2010
Hi,
I have open url and read like following:
$import urllib
$txt = urllib.urlopen("http://www.terme-catez.si").read()
$txt
Gives output like below:
----other parts are skipped ---
r\n 2010\r\n <a href="http://www.terme-catez.si"
target="_blank">Terme
\xc4\x8cate\xc5\xbe</a>\r\n Slovenija\r\n <br />\r\n Spletne
re\
xc5\xa1itve\r\n © 1996-\r\n 2010\r\n <a href="
http://www.tme
dia.biz" target="_blank">(T)media</a></p>\r\n </div>\r\n</div>\r\n<div
class="o
zadje_catez"></div>\r\n<div class="jsPopupDivFader" id="fader"
onClick="javascri
pt:showHide(itemShown);">\r\n <table width="100%" height="100%">\r\n <tr
val
ign="middle">\r\n <td align="center"></td>\r\n </tr>\r\n
</table>\r\n</
div>\r\n\r\n<script src="http://www.google-analytics.com/urchin.js"
type="text/j
avascript"></script>\r\n<script type="text/javascript">\r\n_uacct =
"UA-1815955-
1";\r\nurchinTracker();\r\n</script>\r\n\r\n</body>\r\n</html>\r\n'
If you see above, in junk of HTLM, there is text like 'Terme
\xc4\x8cate\xc5\xbe' (original is 'Terme Čatež'). Now, I want to convert
code like '\xc4\x8c' or '\xc5\xbe' to unaccented chars so that 'Terme
\xc4\x8cate\xc5\xbe' become 'Terme Catez'. Is there any way convert from
whole HTML.
Thanks in advance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100608/c4459b16/attachment.html>
More information about the Tutor
mailing list