Unexpected behaviour with HTMLParser...
Just Another Victim of the Ambient Morality
ihatespam at hotmail.com
Tue Oct 9 23:07:45 CEST 2007
HTMLParser is behaving in, what I find to be, strange ways and I would
like to better understand what it is doing and why.
First, it doesn't appear to translate HTML escape characters. I don't
know the actual terminology but things like & don't get translated into
& as one would like. Furthermore, not only does HTMLParser not translate it
properly, it seems to omit it altogether! This prevents me from even doing
the translation myself, so I can't even working around the issue.
Why is it doing this? Is there some mode I need to set? Can anyone
else duplicate this behaviour? Is it a bug?
Secondly, HTMLParser often calls handle_data() consecutively, without
any calls to handle_starttag() in between. I did not expect this. In HTML,
you either have text or you have tags. Why split up my text into successive
handle_data() calls? This makes no sense to me. At the very least, it does
this in response to text with & like escape sequences (or whatever
they're called), so that it may successively avoid those translations.
Again, why is it doing this? Is there some mode I need to set? Can
anyone else duplicate this behaviour? Is it a bug?
These are serious problems for me and I would greatly appreciate a
deeper understanding of these issues.
More information about the Python-list