Parsing broken HTML via Mozilla
sbabbitt at commspeed.net
Tue Aug 10 02:17:01 CEST 2004
"Walter Dörwald" <walter at livinglogic.de> wrote in message
news:mailman.1413.1092080863.5135.python-list at python.org...
> Hello all!
> I'm trying to parse broken HTML with several Python tools.
> Unfortunately none of them work 100% reliable. Problems are
> e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
> "if foo < bar") etc.
> All of these pages can be displayed properly in a browser
> so why not reuse the parser in e.g. Mozilla? Is there any
> way to get proper XML out of Mozilla? Calling mozilla on the
> command line would be OK, but it would be better if I could
> use Mozilla like a SAX parser. Is there any project that
> provides this functionality?
> Walter Dörwald
> Maybe you should preprocess your files with something like,
which can help you get rid of the stuff you dont want
More information about the Python-list