Surprising behavior of lxml.html.tostring()
Hello, I’m a little puzzled by the behavior of the lxml.html.tostring() function, and would appreciate if somebody could shed some light on this. The test code is as follows: first we parse a small HTML document (derived from an actual real-world document!) s = """<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body> </body> </html> """ This reads ok as XML: lxml.etree.XML(s.encode()) # <Element {http://www.w3.org/1999/xhtml}html at 0x10e837d00> lxml.etree.fromstring(s.encode()) # <Element {http://www.w3.org/1999/xhtml}html at 0x10e848980> and HTML: elm = lxml.html.fromstring(s.encode()) # <Element html at 0x10e7d00f0> root = elm.getroottree() root.docinfo.doctype # '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>' Serializing this back to HTML creates an unexpected string, though: lxml.html.tostring(elm.getroottree(), method="xml", encoding="unicode") Produces for lxml v5.3.0 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <?xml version="1.0" encoding="UTF-8"??><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body> </body> </html> and for lxml v6.0.2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <!--?xml version="1.0" encoding="UTF-8"?--><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body> </body> </html> The latter parses ok with both lxml.etree.XML() and lxml.html.fromstring() whereas the former fails to parse as an XML file using lxml.etree.XML(). So it seem that *some* behavior was changed/fixed but I was unable to find that mentioned in the changelog. Both serialized documents, though, are different than the original in that the <!DOCTYPE> and <?XML?> elements are swapped, and removed/commented out entirely. Why? Also, is there a way to generate both elements in the original order? Much thanks! Jens
Hi, Good question! The answer can be found in the libxml2 HTMLparser.c in the function htmlParseDocument in lines 4430-4452 (latest commit hash currently 54824911). (Lxml wraps libxml2, a C library.) As you can see, libxml2 expects a doctype declaration always to begin with <!DOCTYPE. In this case, libxml2 calls htmlParseDocTypeDecl(ctxt) and the doctype is parsed. However, in lines 4450-4452, you can see that an XML or XML-like declaration beginning with <? leads to a "bogus" comment being recorded - essentially a malformed comment. When libxml2 finishes parsing, it adds a doctype in. In libxml2, the doctype is stored in ctxt->myDoc->internalSubset (not sure why). In SAX2.c the function xmlSAX2EndDocument runs when an HTML document finishes parsing. You can see in lines 869-874 that intSubset on the document is set if it is originally NULL. And, the hard-coded doctype matches what you see in your testing. Also, regarding recent changes between lxml versions, I'm not sure where this is coming from, but there's a commit in libxml2 from seven months ago that modifies this code, commit b424bae7. To answer your question of fixing this, I doubt there's a way without changing those lines of code in libxml2. Links to the code: https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd865... https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd865... Link to the commit: https://gitlab.gnome.org/GNOME/libxml2/-/commit/b424bae705180a2d6df2db1767e3... Best, Abe
participants (2)
-
abepolk@gmail.com -
Jens Tröger