SGMLParser eats ä etc

John J. Lee jjl at
Tue Dec 2 13:19:01 CET 2003

Anders Eriksson <ameLista at> writes:

> On 30 Nov 2003 00:53:28 +0000, John J. Lee wrote:
> > You probably want to use HTMLParser.HTMLParser instead (NOT the same
> > thing as htmllib.HTMLParser, note).  It knows about XHTML, sgmllib &
> > htmllib don't.  
> å etc isn't XHTML, is it? AFAIK it is defined in HTML 4.

It'll cope with HTML too.  It seems silly to be writing new code now
that will choke on XHTML.

> the strange thing is that the Character entity (i.e. å) is stripped
> from the text. I don't want to change it since I'm feeding the output to a
> browser.

Did you read my post?  Read the docs on the stuff in my code snippet.

> I will try the HTMLParser instead but it seems to me that there is a bug in
> SMGLParser...

It's not a bug.  That's just what it does, but you can easily override it.


More information about the Python-list mailing list