SGMLParser eats ä etc

John J. Lee jjl at pobox.com
Sat Nov 29 19:53:28 EST 2003


Anders Eriksson <ameLista at telia.com> writes:
> I'm using smgllib (ActivePython 2.3.2, build 230) and I have some trouble
> with letters that has been coded, e.g. the letter å is coded å ä is
> coded ä and ö is coded ö all according to the html standard.
> 
> I use the SGMLParser and when I feed method all the coded letter will be
> stripped/eaten.
> 
> Why?
> How do I fix this?

You probably want to use HTMLParser.HTMLParser instead (NOT the same
thing as htmllib.HTMLParser, note).  It knows about XHTML, sgmllib &
htmllib don't.  If you really want sgmllib, though (untested):

import htmlentitydefs

class MyParser(sgmllib.SGMLParser):
    entitydefs = htmlentitydefs.entitydefs

    def unknown_entityref(self, ref):
        ...

    ...


John




More information about the Python-list mailing list