Parsing broken HTML via Mozilla

Tom B. sbabbitt at
Tue Aug 10 02:17:01 CEST 2004

"Walter Dörwald" <walter at> wrote in message
news:mailman.1413.1092080863.5135.python-list at
> Hello all!
> I'm trying to parse broken HTML with several Python tools.
> Unfortunately none of them work 100% reliable. Problems are
> e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
> "if foo < bar") etc.
> All of these pages can be displayed properly in a browser
> so why not reuse the parser in e.g. Mozilla? Is there any
> way to get proper XML out of Mozilla? Calling mozilla on the
> command line would be OK, but it would be better if I could
> use Mozilla like a SAX parser. Is there any project that
> provides this functionality?
> Bye,
>     Walter Dörwald
> Maybe you should preprocess your files with something like,
which can help you get rid of the stuff you dont want


More information about the Python-list mailing list