SGMLParser eats ä etc

Peter Hansen peter at engcorp.com
Mon Dec 1 13:16:40 EST 2003


Anders Eriksson wrote:
> 
> On 30 Nov 2003 00:53:28 +0000, John J. Lee wrote:
> 
> > You probably want to use HTMLParser.HTMLParser instead (NOT the same
> > thing as htmllib.HTMLParser, note).  It knows about XHTML, sgmllib &
> > htmllib don't.
> å etc isn't XHTML, is it? AFAIK it is defined in HTML 4.
> 
> the strange thing is that the Character entity (i.e. å) is stripped
> from the text. I don't want to change it since I'm feeding the output to a
> browser.
> 
> I will try the HTMLParser instead but it seems to me that there is a bug in
> SMGLParser...

If it's anything like the expat parser, it munches any undefined character
entity references (i.e. if there's a DTD but no definitions) unless you plug
in an appropriate entity reference subparser.

-Peter




More information about the Python-list mailing list