Hello,
I have a question regarding the HTMLParser in lxml. Is there a way to make
it less 'strict'? For example, as I have it currently configured, it will
re-write
<h1><p>Hello</p></h1>
to
<h1></h1><p>Hello</p>
The HTML spec does not allow <p> tags to be contained within <h1> tags.
However, for my specific use case, I would like to try and leave the html
as close to the original html as possible even if it is invalid.
The code I am using presently is as follows:
>>> h = "<h1><p>Hello</p></h1>"
>>> etree_parser = etree.HTMLParser()
>>> tree = etree.fromstring(h, parser=etree_parser)
>>> etree.tostring(tree, method='html')
The version of lxml I am using is 3.5.
So I was wondering if there was a way to disable this re-ordering of the
tags so that the <p> would remain as a child of the <h1>?
Thanks!