HTMLParser ignores unicode entities
Thomas Guettler
zopestoller at thomas-guettler.de
Tue Dec 17 07:31:09 EST 2002
Hi!
MS-Excel exports cyrillish characters encoded as entities.
Example:
Отпадъци
HTMLParser ignores these entities.
In the archives I found the following solution:
def get_html_entities():
import htmlentitydefs
myentitydefs = htmlentitydefs.entitydefs.copy()
for k,v in myentitydefs.items():
#print "in myentities:", k, v
if v.startswith('&#'):
v = int(v[2:-1])
else:
v = ord(v)
myentitydefs[k] = unichr(v)
return myentitydefs
class MSExcelHTMLParser(htmllib.HTMLParser):
entitydefs=get_html_entities()
This only works for the HTML entities.
I could add the entities for all unicode characters,
but there are a lot. I don't think that's the best
solution.
Does someone know how I can parse HTML files containing
unicode entities?
thomas
More information about the Python-list
mailing list