Over at http://bugzilla.gnome.org/show_bug.cgi?id=569131 I reported what I thought was a bug in HTMLParser but on closer inspection appears to be an incorrect assumption on my part (and that of lxml) when dealing with errors returned by the push parser interface. With the libxml2 bindings, I am able to parse invalid html using the push parser: >>> import libxml2 >>> options = libxml2.HTML_PARSE_RECOVER | libxml2.HTML_PARSE_NONET >>> p = libxml2.htmlCreatePushParser(None, "", 0, "test") >>> p.ctxtUseOptions(options) 0 >>> bad1 = '''<p><pre></pre></p>\n''' >>> p.htmlParseChunk(bad1, len(bad1), 0) test:1: HTML parser error : Unexpected end tag : p <p><pre></pre></p> ^ 76 >>> good = '''<div>foo</div>\n''' >>> p.htmlParseChunk(good, len(good), 0) 76 >>> p.htmlParseChunk("", 0, 1) 76 >>> print p.doc().serialize() <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p></p><pre></pre> <div>foo</div></body></html> But with lxml, the parser is reset on encountering an error: >>> from lxml.etree import HTMLParser, dump >>> p = HTMLParser(recover=True) >>> bad1 = '''<p><pre></pre></p>\n''' >>> p.feed(bad1) Traceback (most recent call last): File "<console>", line 1, in ? File "parser.pxi", line 1093, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:61114) File "parser.pxi", line 534, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:56605) File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:57504) File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:56902) XMLSyntaxError: Unexpected end tag : p, line 1, column 19 >>> good = '''<div>foo</div>\n''' >>> p.feed(good) >>> elem = p.close() And previous state is lost: >>> dump(elem) <html> <body> <div>foo</div> </body> </html> In fact, I'm unable to retrieve any state from the parser unless it is reset: >>> p.feed(bad1) Traceback (most recent call last): File "<console>", line 1, in ? File "parser.pxi", line 1093, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:61114) File "parser.pxi", line 534, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:56605) File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:57504) File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:56902) XMLSyntaxError: Unexpected end tag : p, line 1, column 19 >>> p.close() Traceback (most recent call last): File "<console>", line 1, in ? File "parser.pxi", line 1113, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:61239) XMLSyntaxError: no element found So in my view, the behaviour here is not helpful. When a parser is created with recover=True it should not raise errors, so allowing incremental parsing of invalid html. Laurence