Mailman 3 Is there a way to stop the HTMLParser rewriting invalid HTML tags? - lxml - The Python XML Toolkit

25 Nov 2015

      Hello,

I have a question regarding the HTMLParser in lxml. Is there a way to make
it less 'strict'? For example, as I have it currently configured, it will
re-write

<h1><p>Hello</p></h1>

to

<h1></h1><p>Hello</p>

The HTML spec does not allow <p> tags to be contained within <h1> tags.
However, for my specific use case, I would like to try and leave the html
as close to the original html as possible even if it is invalid.

The code I am using presently is as follows:
...
...
...
h = "<h1><p>Hello</p></h1>"
etree_parser = etree.HTMLParser()
tree = etree.fromstring(h, parser=etree_parser)
etree.tostring(tree, method='html')
The version of lxml I am using is 3.5.

So I was wondering if there was a way to disable this re-ordering of the
tags so that the <p> would remain as a child of the <h1>?

Thanks!

Is there a way to stop the HTMLParser rewriting invalid HTML tags?

Austin Platt

tags

participants (1)