Parsing broken HTML via Mozilla
walter at livinglogic.de
Mon Aug 9 21:47:40 CEST 2004
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.
All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?
More information about the Python-list