tostring() returning possible invalid XML. Is this a bug?

In more recent versions of lxml the tostring() method can return extra text after the closing tag of the node I've passed to it. So instead of returning b'<form action="action1">\n</form>\n' it returns b'<form action="action1">\n</form>\n</body>\n</html>\n' Here's a (python3) script along with two outputs, one from a machine running lxml 4.6.5 and one running 4.8.0. NOTE the output only changes if the DOCTYPE line is left in the "html" variable. import sys from lxml import etree print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <body> <form action="action1"> </form> </body> </html> """ parser = etree.XMLParser() doc = etree.fromstring(html, parser=parser) nodeList = doc.xpath("//form") print(etree.tostring(nodeList[0])) This is the output I would expect to see: Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) lxml.etree : (4, 6, 5, 0) libxml used : (2, 9, 10) libxml compiled : (2, 9, 10) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34) b'<form action="action1">\n</form>\n' #<------ Notice how the tostring() has returned the opening and closing <form> node (as I expected) This is the output I get when I upgrade: Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) lxml.etree : (4, 8, 0, 0) libxml used : (2, 9, 12) libxml compiled : (2, 9, 12) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34) b'<form action="action1">\n</form>\n</body>\n</html>\n' #<-------- Notice how the tostring() has returned extra text after the closing </form> tag Is this a bug? Or is this expected behaviour if the DOCTYPE is defined in the html passed to etree.fromstring()? ie. Is there a valid reason why tostring() might return an invalid XML byte string? I've seen the same behaviour in lxml 4.7.1 but I've not tried 4.9.0 as it's not in my repo yet. Any help appreciated! If this is a deliberate change I've got quite a lot of legacy code that will need updating to cope.

On 7 Jun 2022, at 16:56, brian.bird@trustpayments.com wrote:
In more recent versions of lxml the tostring() method can return extra text after the closing tag of the node I've passed to it. So instead of returning
b'\n\n'
it returns
b'\n\n\n\n'
This looks **a lot** like this https://mail.python.org/archives/list/lxml@python.org/thread/LCTOSIIWGGALAMS... Can you update your version of libxml2? Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

Hi Brian & Charlie, I'm not the OP; but, FYI, i can see the same issue (on an Intel Mac): aid@orac tmp % ./tail.py Python : sys.version_info(major=3, minor=9, micro=13, releaselevel='final', serial=0) lxml.etree : (4, 9, 0, 0) libxml used : (2, 9, 14) libxml compiled : (2, 9, 14) libxslt used : (1, 1, 35) libxslt compiled : (1, 1, 35) b'<form action="action1">\n</form>\n</body>\n</html>\n' You can see my machine is using lxml 2.9.14; which is a pity as in the thread you linked to it looked like the issue would have been resolved in that version... However, I found that if you update the call to etree.tostring() to use method='html' then the trailing body and html elements are no longer shown. i.e.: print(etree.tostring(nodeList[0], method='html')) With that update made, the script outputs the desired: aid@orac tmp % python3 -i tail.py Python : sys.version_info(major=3, minor=9, micro=13, releaselevel='final', serial=0) lxml.etree : (4, 9, 0, 0) libxml used : (2, 9, 14) libxml compiled : (2, 9, 14) libxslt used : (1, 1, 35) libxslt compiled : (1, 1, 35) b'<form action="action1">\n</form>\n' I've no idea why this behaviour seems to have changed.... Kind regards aid
On 7 Jun 2022, at 17:02, Charlie Clark <charlie.clark@clark-consulting.eu> wrote:
On 7 Jun 2022, at 16:56, brian.bird@trustpayments.com <mailto:brian.bird@trustpayments.com> wrote:
In more recent versions of lxml the tostring() method can return extra text after the closing tag of the node I've passed to it. So instead of returning
b'\n\n'
it returns
b'\n\n\n\n'
This looks a lot like this https://mail.python.org/archives/list/lxml@python.org/thread/LCTOSIIWGGALAMS... <https://mail.python.org/archives/list/lxml@python.org/thread/LCTOSIIWGGALAMS...> Can you update your version of libxml2?
Charlie
-- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: aid@logic.org.uk
participants (3)
-
Adrian Bool
-
brian.bird@trustpayments.com
-
Charlie Clark