[Tutor] accented characters to unaccented

KB SU k247d0 at gmail.com
Tue Jun 8 08:48:31 CEST 2010


Hi,

I have open url and read like following:

$import urllib
$txt = urllib.urlopen("http://www.terme-catez.si").read()
$txt

Gives output like below:
----other parts are skipped ---
r\n      2010\r\n      <a href="http://www.terme-catez.si"
target="_blank">Terme
 \xc4\x8cate\xc5\xbe</a>\r\n      Slovenija\r\n      <br />\r\n      Spletne
re\
xc5\xa1itve\r\n      &copy; 1996-\r\n      2010\r\n      <a href="
http://www.tme
dia.biz" target="_blank">(T)media</a></p>\r\n  </div>\r\n</div>\r\n<div
class="o
zadje_catez"></div>\r\n<div class="jsPopupDivFader" id="fader"
onClick="javascri
pt:showHide(itemShown);">\r\n  <table width="100%" height="100%">\r\n    <tr
val
ign="middle">\r\n      <td align="center"></td>\r\n    </tr>\r\n
</table>\r\n</
div>\r\n\r\n<script src="http://www.google-analytics.com/urchin.js"
type="text/j
avascript"></script>\r\n<script type="text/javascript">\r\n_uacct =
"UA-1815955-
1";\r\nurchinTracker();\r\n</script>\r\n\r\n</body>\r\n</html>\r\n'

If you see above, in junk of HTLM, there is text like 'Terme
\xc4\x8cate\xc5\xbe'  (original is 'Terme Čatež'). Now, I want to convert
code like '\xc4\x8c' or '\xc5\xbe' to unaccented chars so that 'Terme
\xc4\x8cate\xc5\xbe' become 'Terme Catez'. Is there any way convert from
whole HTML.

Thanks in advance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100608/c4459b16/attachment.html>


More information about the Tutor mailing list