Is there a way to stop the HTMLParser rewriting invalid HTML tags?
Hello, I have a question regarding the HTMLParser in lxml. Is there a way to make it less 'strict'? For example, as I have it currently configured, it will re-write <h1><p>Hello</p></h1> to <h1></h1><p>Hello</p> The HTML spec does not allow <p> tags to be contained within <h1> tags. However, for my specific use case, I would like to try and leave the html as close to the original html as possible even if it is invalid. The code I am using presently is as follows:
h = "<h1><p>Hello</p></h1>" etree_parser = etree.HTMLParser() tree = etree.fromstring(h, parser=etree_parser) etree.tostring(tree, method='html')
The version of lxml I am using is 3.5. So I was wondering if there was a way to disable this re-ordering of the tags so that the <p> would remain as a child of the <h1>? Thanks!
participants (1)
-
Austin Platt