Parsing broken HTML via Mozilla

Walter Dörwald walter at
Wed Aug 11 12:09:01 CEST 2004

Paul Wright wrote:

> In article <mailman.1413.1092080863.5135.python-list at>, Walter
> Dörwald wrote:
>>I'm trying to parse broken HTML with several Python tools.
>>Unfortunately none of them work 100% reliable. Problems are e.g.
>>nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
>>bar") etc.
> Not a Mozilla solution, but I hear good things about

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

    Walter Dörwald

More information about the Python-list mailing list