Parsing broken HTML via Mozilla

Walter Dörwald walter at livinglogic.de
Wed Aug 11 12:09:01 CEST 2004


Paul Wright wrote:

> In article <mailman.1413.1092080863.5135.python-list at python.org>, Walter
> Dörwald wrote:
> 
>>I'm trying to parse broken HTML with several Python tools.
>>Unfortunately none of them work 100% reliable. Problems are e.g.
>>nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
>>bar") etc.
> 
> Not a Mozilla solution, but I hear good things about
> http://www.crummy.com/software/BeautifulSoup/

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
    Walter Dörwald





More information about the Python-list mailing list