Difficulty using htmllib
pan-newsreader at thomas-guettler.de
Mon Jan 20 18:46:36 CET 2003
On Sat, 18 Jan 2003 22:13:37 +0100, Peter Abel wrote:
> joshua at goodish.org (Joshua Goodlett) wrote in message
> news:<e81e0f3d.0301180716.5b1aba2a at posting.google.com>...
>> I've written a small web robot script to familiarize myself with both
>> the htmllib and urllib modules but have been getting an error message
>> when my HTMLParser subclass is fed. I've included the relevant code,
>> as well as the error message, below:
If you look at the top of the HTML file from
http://www.turn-keywireless.com/ you see that it is created
with MS-Frontpage. I once had a similar problem when I
tried to parse the HTML output of MS-Excel.
I think the problem is here: (Quote)
Try parsing the file without it. I help me by removing all
this MS stuff with a regular expression before sending it
to the parser.
See (first link)
There is a bug in the bugtracker, too:
Although this bug is closed with the comment:
This is not an actual bug in the interpretation of HTML, and
there has not been a recurring pattern of complaints about
this. Given that we do not want to encourage the creation
of broken HTML, this edge case will not be allowed to
further complicate the code.
I think that htmllib should be fixed.
All volunteers to the keyboard!
Thomas Guettler <guettli at thomas-guettler.de>
More information about the Python-list