Good HTML Parser

Diez B. Roggisch deets at nospam.web.de
Thu Jul 17 07:31:13 EDT 2008


Chris wrote:

> Can anyone recommend a good HTML/XHTML parser, similar to
> HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
> know that certain tags, like <br>, are implicitly closed? I need to
> iterate through the entire DOM, building up a DOM path, but the stdlib
> parsers aren't calling handle_endtag() for any implicitly closed tags.
> I looked at BeautifulSoup, but it only seems to work by first parsing
> the entire document, then allowing you to query the document
> afterwards. I need something like a SAX parser.

This isn't possible. Your own example of arbitrarily closeable Tags needs
context that just a SAX-like parser can't provide.

I suggest you use BeautifulSoup, and if you must create your own
event-generation around that which you can attach consumers to.

Diez



More information about the Python-list mailing list