SGMLParser eats ä etc
John J. Lee
jjl at pobox.com
Tue Dec 2 13:19:01 CET 2003
Anders Eriksson <ameLista at telia.com> writes:
> On 30 Nov 2003 00:53:28 +0000, John J. Lee wrote:
> > You probably want to use HTMLParser.HTMLParser instead (NOT the same
> > thing as htmllib.HTMLParser, note). It knows about XHTML, sgmllib &
> > htmllib don't.
> å etc isn't XHTML, is it? AFAIK it is defined in HTML 4.
It'll cope with HTML too. It seems silly to be writing new code now
that will choke on XHTML.
> the strange thing is that the Character entity (i.e. å) is stripped
> from the text. I don't want to change it since I'm feeding the output to a
Did you read my post? Read the docs on the stuff in my code snippet.
> I will try the HTMLParser instead but it seems to me that there is a bug in
It's not a bug. That's just what it does, but you can easily override it.
More information about the Python-list