HTMLParser ignores unicode entities
Fredrik Lundh
fredrik at pythonware.com
Tue Dec 17 11:56:46 EST 2002
Thomas Guettler wrote:
> MS-Excel exports cyrillish characters encoded as entities.
>
> Example:
> Отпадъци
>
> HTMLParser ignores these entities.
if you're using htmllib.HTMLParser, the parser calls the handle_charref
method for all charrefs. the default implementation calls unknown_charref
for charrefs outside the ISO-8859-1 range. to handle other charrefs, over-
ride unknown_charref in a subclass.
from htmllib import HTMLParser
class MyParser(HTMLParser):
def unknown_charref(self, ref):
print "CHARREF", ref
p = MyParser(formatter)
p.feed("Отпадъци")
p.close()
if you're using HTMLParser.HTMLParser, the parser calls the handle_charref
method for all charrefs.
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def handle_charref(self, ref):
print "CHARREF", ref
p = MyParser()
p.feed("Отпадъци")
p.close()
see the library reference for more information.
</F>
<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list