lxml.html.fromstring() doesn’t seem to get the doctype right?
Hello, Following from my previous post ( https://mail.python.org/archives/list/lxml@python.org/thread/NT7GNLORN676BMS... ) I also noticed that reading an x/html file without doctype produces an incorrect/unexpected doctype. For example: b = b"""<?xml version="1.0" encoding="UTF-8”?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html> “"" parses ok into an element and element tree: elm = lxml.html.fromstring(b) # <Element html at 0x10fbea530> but the doctype for that document is — I believe — incorrect: root = elm.getroottree() root.docinfo.doctype # '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd”>' Considering the xml declaration and the html element’s namespace, I would have expected the derived doctype to be <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN” "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”> for an xhtml file. Also, the DocInfo ( https://lxml.de/apidoc/lxml.etree.html#lxml.etree.DocInfo ) doesn’t actually denote whether the original document contained an xml declaration; wouldn’t a flag be useful? I ask because ideally round-tripping a document should produce that same document, but that is currently not the case: b = b"""<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html>”"" elm = lxml.html.fromstring(b) # <Element html at 0x10fbea670> lxml.html.tostring(elm) # b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html>' lxml.html.tostring(elm.getroottree()) # b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<!--?xml version="1.0" encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html>’ lxml.html.tostring(elm.getroottree(), method="xml”) # b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<!--?xml version="1.0" encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"/>' Cheers, Jens
participants (1)
-
Jens Tröger