HTMLParser ignores unicode entities

Fredrik Lundh fredrik at pythonware.com
Tue Dec 17 11:56:46 EST 2002


Thomas Guettler wrote:

> MS-Excel exports cyrillish characters encoded as entities.
>
> Example:
>   Отпадъци
>
> HTMLParser ignores these entities.

if you're using htmllib.HTMLParser, the parser calls the handle_charref
method for all charrefs.  the default implementation calls unknown_charref
for charrefs outside the ISO-8859-1 range.  to handle other charrefs, over-
ride unknown_charref in a subclass.

    from htmllib import HTMLParser

    class MyParser(HTMLParser):
        def unknown_charref(self, ref):
            print "CHARREF", ref

    p = MyParser(formatter)
    p.feed("Отпадъц&#1080")
    p.close()

if you're using HTMLParser.HTMLParser, the parser calls the handle_charref
method for all charrefs.

    from HTMLParser import HTMLParser

    class MyParser(HTMLParser):
        def handle_charref(self, ref):
            print "CHARREF", ref

    p = MyParser()
    p.feed("Отпадъц&#1080")
    p.close()

see the library reference for more information.

</F>

<!-- (the eff-bot guide to) the python standard library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->






More information about the Python-list mailing list