exhaustive mapping from html entities to unicode ?

shagshag13 shagshag13 at yahoo.fr
Fri Mar 7 16:38:35 CET 2003


hello,

in fact my original post was motivated by this : i wan't to do html
stripping from web pages, but i would like to convert html entities to their
readable chars and i think i am messing up everything...

for example : why could i write these @ dos prompt while idle send me a
"UnicodeError: ASCII encoding error: ordinal not in range(128)" ?

dos = {'Œ' : 'O', 'œ' : 'o', 'Š' : 'S', 'š' : 's',
'Ÿ' : 'Y',
'ˆ' : '^', '˜' : '~',
'–' : '-', '—' : '-', '‘' : ''', '’' : ''',
'‚' : ',', '“' : '"', '”' : '"',
'„' : '"', '†' : '?', '‡' : '?', '‰' : '?',
'&#8249;' : '<', '&#8250;' : '>', '&#8364;' : '?',
'&#402;' : 'f',
'&#8226;' : '.',
'&#8482;' : 'T' }

and if i write this as unicode string, i won't be able to print them @ dos
prompt...

also is there a way other that checking <?xml version="1.0"
encoding="'iso-8859-1'"?> to know, from a dowloaded web page, its encoding ?

thanks,

s13.






More information about the Python-list mailing list