[lxml-dev] HTMLParser behavior - ill formed HTML
Hi all, My name is Dean and I've recently joined the lxml-dev to watch how things are going most notably since the HTML parser has been added to the trunk. We (my colleague and I) have been using lxml in the past for processing of rather large DocBook type XML docs aplit across multiple files and in mixed namespaces without any problems and the experiences with lxml have been nothing but good, so let me first congratulate the developers on a great work lxml is. Lately I've looked into using the lxml trunk revision for analyzing some HTML (not well-formed). I've stumbled upon the feature for which I'm not sure whether it's intended to be. When I feed an ill-formed HTML to a parser via
doc=etree.parse('afile.html', parser=etree.HTMLParser()) #recover=True by default an exception is raised and the function yields no result.
doc=libxml2.htmlParseFile('afile.html', None)
When I use libxml2 directly: libxml2 prints out some errors/warnings but I _do_ get a reference to a document, which can normally be used. I've also tried to use the procedure as in parser.pxi i.e. htmlCreateMemoryParserCtxt() and htmlCtxtReadDoc() with the same results. So I tracked this behavior down to calling _handleParseResult(pctxt, result, NULL) at the end of parse* methods in HTMLParser in parser.pxi, note the 'if.ctxt.wellFormed' part: <!-- snip cdef xmlDoc* _handleParseResult(xmlParserCtxt* ctxt, xmlDoc* result, char* c_filename) except NULL: cdef _ResolverContext context if ctxt.wellFormed: __GLOBAL_PARSER_CONTEXT._initDocDict(result) elif result is not NULL: # free broken document tree.xmlFreeDoc(result) result = NULL --> where the document is destroyed if libxml2's context wellFormed flag is not set. I checked this by calling libxml2 htmlCtxtReadDoc() directly on that document and indeed the wellFormed flag turned out to be 0. Now, shouldn't the HTMLParser also return the document reference in this case if recover=True flag is specified since libxml apparently does not have problems with that. I've checked this by modifying the _handleParseResult with 'accept_ill_formed' argument. If that flag is set, ctxt.wellFormed would be ignored. Also modified _handleParseResult calls in HTMLParser's parse... methods to specify accept flag if 'recover' was set in the constructor. This turned to work just well and I was able to navigate the document with xpath, even the errors (using '&' in href attributes) were corrected. So the question would be is the present behavior correct due to something I'm possibly missing? I think it should be dependent on whether RECOVER flags have been specified or not. Best regards, Dean
Hi Dean, Dean Pavlekovic wrote:
experiences with lxml have been nothing but good, so let me first congratulate the developers on a great work lxml is.
Thanks! :)
When I feed an ill-formed HTML to a parser an exception is raised and the function yields no result. When I use libxml2 directly, libxml2 prints out some errors/warnings but I _do_ get a reference to a document, which can normally be used.
So I tracked this behavior down to calling _handleParseResult(pctxt, result, NULL) at the end of parse* methods in HTMLParser in parser.pxi, where the document is destroyed if libxml2's context wellFormed flag is not set. I checked this by calling libxml2 htmlCtxtReadDoc() directly on that document and indeed the wellFormed flag turned out to be 0.
Now, shouldn't the HTMLParser also return the document reference in this case if recover=True flag is specified since libxml apparently does not have problems with that.
It's absolutely reasonable to do that. My guess is that libxml2 will always try to return either NULL or a correct and usable in-memory structure no matter how broken and incomplete the parsed data was. So if it returns anything but NULL, that should be usable. I changed the trunk to always accept ill-formed results if the recover option is set and no lxml-internal errors were raised. Please try if that helps. I couldn't come up with a sufficiently short example of broken HTML where this problem occurs, so I couldn't test it. The examples that were tested so far can be found in src/lxml/tests/test_htmlparser.py. Thanks for the report, Stefan
participants (2)
-
Dean Pavlekovic
-
Stefan Behnel