htmllib.py and parsing malformed HTML
nskhcarlso at bellsouth.net
Tue Sep 2 14:15:03 CEST 2003
Thomas Güttler wrote:
> You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
> parse the html.
I appreciate the suggestion but unfortunately this will not work well
for me as the parser runs as part of a cron job. I wouldn't be able to
review the tidy error log in a timely fashion if there was a problem.
What would be really nice is a way to tell the parser it was "inside" a
<TR> when I encountered a <TD> after a closing </TR>. Browsers still
display the HTML correctly without a starting <TR>, but if the closing
</TR> is omitted everything gets mangled.
Any other suggestions?
More information about the Python-list