etree.tostring returns all content after element in XHTML 1.0 Transitional?
data:image/s3,"s3://crabby-images/059b9/059b99d8c9fe4316d9421179b2003f546d94854a" alt=""
I recently noticed what seems like an odd behavior of etree.tostring, and I'm trying to figure out if this is a bug or some subtlety of the API or of X(HT)ML processing that I'm not aware of. Given the following document (1_1.xhtml): <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Title</title> </head> <body> <p>One</p> <p>Two</p> <p>Three</p> </body> </html> and the following script (test-tostring.py): import sys import lxml import lxml.html doc = lxml.etree.parse(sys.argv[1], parser=lxml.html.XHTMLParser()) body = doc.find(".//{*}body") for elt in body: print(lxml.etree.tostring(elt)) running "python3 test-tostring.py 1_1.xhtml" produces this output: b'<p xmlns="http://www.w3.org/1999/xhtml">One</p>\n ' b'<p xmlns="http://www.w3.org/1999/xhtml">Two</p>\n ' b'<p xmlns="http://www.w3.org/1999/xhtml">Three</p>\n ' So far, so good. However, if I copy 1_1.xhtml to 1_0-transitional.xhtml and change its doctype to <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> then the output of "python3 test-tostring.py 1_0-transitional.xhtml" is: b'<p xmlns="http://www.w3.org/1999/xhtml">One</p>\n <p>Two</p>\n <p>Three</p>\n </body>\n</html>\n ' b'<p xmlns="http://www.w3.org/1999/xhtml">Two</p>\n <p>Three</p>\n </body>\n</html>\n ' b'<p xmlns="http://www.w3.org/1999/xhtml">Three</p>\n </body>\n</html>\n ' In other words, the text of each element as serialized with tostring() includes the entire rest of the document after it, not just its own subtree! This is, at the very least, not what I was expecting to see. Both XHTML documents pass the checks on validator.w3.org, so I don't think it's a matter of bad formatting, and I haven't been able to find anything in the lxml documentation or recent changelogs that would explain it. Using an lxml.etree.XMLParser as the parser produces the same results. Setting with_tail=False removes the trailing whitespace from each line, but not the content after the element in the 1.0 Transitional doc. Any idea what might be causing this? Version information: Python : sys.version_info(major=3, minor=11, micro=2, releaselevel='final', serial=0) lxml.etree : (4, 9, 2, 0) libxml used : (2, 9, 14) libxml compiled : (2, 9, 14) libxslt used : (1, 1, 35) libxslt compiled : (1, 1, 35)
participants (1)
-
Jim Wisniewski