HTMLParser ignores unicode entities

Fredrik Lundh fredrik at
Tue Dec 17 11:56:46 EST 2002

Thomas Guettler wrote:

> MS-Excel exports cyrillish characters encoded as entities.
> Example:
>   Отпадъци
> HTMLParser ignores these entities.

if you're using htmllib.HTMLParser, the parser calls the handle_charref
method for all charrefs.  the default implementation calls unknown_charref
for charrefs outside the ISO-8859-1 range.  to handle other charrefs, over-
ride unknown_charref in a subclass.

    from htmllib import HTMLParser

    class MyParser(HTMLParser):
        def unknown_charref(self, ref):
            print "CHARREF", ref

    p = MyParser(formatter)

if you're using HTMLParser.HTMLParser, the parser calls the handle_charref
method for all charrefs.

    from HTMLParser import HTMLParser

    class MyParser(HTMLParser):
        def handle_charref(self, ref):
            print "CHARREF", ref

    p = MyParser()

see the library reference for more information.


<!-- (the eff-bot guide to) the python standard library:

More information about the Python-list mailing list