trying to parse non valid html documents with HTMLParser

Benjamin Niemann pink at odahoda.de
Tue Aug 2 16:14:13 EDT 2005


florent wrote:

> I'm trying to parse html documents from the web, using the HTMLParser
> class of the HTMLParser module (python 2.3), but some web documents are
> not fully valids.

Some?? Most of them :(

> When the parser finds an invalid tag, he raises an 
> exception. Then it seems impossible to resume the parsing just after
> where the exception was raised. I'd like to continue parsing an html
> document even if an invalid tag was found. Is it possible to do this ?

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.

You might pipe the document through an external tool like HTML Tidy
<http://www.w3.org/People/Raggett/tidy/> before you feed it into
HTMLParser.


-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/



More information about the Python-list mailing list