[lxml-dev] HTMLParser behavior - ill formed HTML

1 May 2006

      Hi all,

My name is Dean and I've recently joined the lxml-dev to watch how
things are going most notably since the HTML parser has been added to
the trunk. We (my colleague and I) have been using lxml in the past
for processing of rather large DocBook type XML docs aplit across
multiple files and in mixed namespaces without any problems and the
experiences with lxml have been nothing but good, so let me  first
congratulate the developers on a great work lxml is.

Lately I've looked into using the lxml trunk revision for analyzing
some HTML (not well-formed). I've stumbled upon the feature for which
I'm not sure whether it's intended to be.

When I feed an ill-formed HTML to a parser via
...
doc=etree.parse('afile.html', parser=etree.HTMLParser()) #recover=True by default
an exception is raised and the function yields no result.
...
doc=libxml2.htmlParseFile('afile.html', None)
When I use libxml2 directly:
libxml2 prints out some errors/warnings but I _do_ get a reference to
a document, which can normally be used. I've also tried to use the
procedure as in parser.pxi i.e. htmlCreateMemoryParserCtxt() and 
htmlCtxtReadDoc() with the same results.

So I tracked this behavior down to calling  _handleParseResult(pctxt,
result, NULL) at the end of parse* methods in HTMLParser in
parser.pxi, note the 'if.ctxt.wellFormed' part:

<!-- snip
cdef xmlDoc* _handleParseResult(xmlParserCtxt* ctxt, xmlDoc* result,
                                char* c_filename) except NULL:
    cdef _ResolverContext context
    if ctxt.wellFormed:
        __GLOBAL_PARSER_CONTEXT._initDocDict(result)
    elif result is not NULL:
        # free broken document
        tree.xmlFreeDoc(result)
        result = NULL
-->
where the document is destroyed if libxml2's context wellFormed flag
is not set. I checked this by calling libxml2 htmlCtxtReadDoc()
directly on that document and indeed the wellFormed flag turned out to
be 0.

Now, shouldn't the HTMLParser also return the document reference in
this case if recover=True flag is specified since libxml apparently
does not have problems with that.

I've checked this by modifying the _handleParseResult with
'accept_ill_formed' argument. If that flag is set, ctxt.wellFormed
would be ignored. Also modified _handleParseResult calls in
HTMLParser's parse... methods to specify accept flag if 'recover' was
set in the constructor.
This turned to work just well and I was able to navigate the document
with xpath, even the errors (using '&' in href attributes) were
corrected.

So the question would be is the present behavior correct due to
something I'm possibly missing? I think it should be dependent on
whether RECOVER flags have been specified or not.

Best regards,
Dean

[lxml-dev] HTMLParser behavior - ill formed HTML

Dean Pavlekovic