Parsing broken HTML via Mozilla
walter at livinglogic.de
Wed Aug 11 12:09:01 CEST 2004
Paul Wright wrote:
> In article <mailman.1413.1092080863.5135.python-list at python.org>, Walter
> Dörwald wrote:
>>I'm trying to parse broken HTML with several Python tools.
>>Unfortunately none of them work 100% reliable. Problems are e.g.
>>nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
> Not a Mozilla solution, but I hear good things about
I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).
More information about the Python-list