exhaustive mapping from html entities to unicode ?

Fri Mar 7 10:38:35 EST 2003

hello,

in fact my original post was motivated by this : i wan't to do html
stripping from web pages, but i would like to convert html entities to their
readable chars and i think i am messing up everything...

for example : why could i write these @ dos prompt while idle send me a
"UnicodeError: ASCII encoding error: ordinal not in range(128)" ?

dos = {'Œ' : 'O', 'œ' : 'o', 'Š' : 'S', 'š' : 's',
'Ÿ' : 'Y',
'ˆ' : '^', '˜' : '~',
'–' : '-', '—' : '-', '‘' : ''', '’' : ''',
'‚' : ',', '“' : '"', '”' : '"',
'„' : '"', '†' : '?', '‡' : '?', '‰' : '?',
'‹' : '<', '›' : '>', '€' : '?',
'ƒ' : 'f',
'•' : '.',
'™' : 'T' }

and if i write this as unicode string, i won't be able to print them @ dos
prompt...

also is there a way other that checking <?xml version="1.0"
encoding="'iso-8859-1'"?> to know, from a dowloaded web page, its encoding ?

thanks,

s13.