docinfo.doctype don't return the original doctype

I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1: from lxml import tree from StringIO import StringIO if __name__ == '__main__': doc = etree.parse(StringIO('''<?xml version="1.0"?> <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false"> <a>tasty</a> </log4j:configuration>''')) print "Type: {}\n".format(doc.docinfo.doctype) But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”> And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> Is it a bug in lxml? Is there a workaround for getting what I’m expecting? =:-) Kim Grønborg Nielsen M kgn+lxml@network-it.dk

But JBoss do care, and fails loading the XML document when DOCTYPE is changed from <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd”> to <!DOCTYPE configuration SYSTEM "log4j.dtd”>, because “configuration” no longer matches the log4j:configuration root. If I do the modification to the XML document with Perl XML::LibXML, then the DOCTYPE isn’t changed. But because we using Ansible for application deployment, we need to use lxml. Unless I do some ugly stuff. =:-) Kim Grønborg Nielsen E kgn+lxml@network-it.dk

Kim Grønborg Nielsen schrieb am 28.12.2017 um 09:13:
It's not clear to me what "the modification" refers to here, i.e. what kind of changes you are doing to the document that involve "docinfo.doctype". Anyway, I wonder what the "right" behaviour is here. lxml only writes the "internal subset" (i.e. any DTD content inside of the document) if the root tag name matches that of the DTD content, which it doesn't in this case. The root tag name is "configuration", whereas the DOCTYPE refers to "log4j:configuration". The reason is that changing the root tag name or serialising subtrees shouldn't write an orphan DOCTYPE. But since DTDs are not namespace aware, it could be argued that the DTD root name actually includes the prefix, however stupid that is from the point of view of XML namespaces. Is there any consensus how other tools (more than just one) commmonly behave w.r.t. the DTD root name? Stefan

But JBoss do care, and fails loading the XML document when DOCTYPE is changed from <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd”> to <!DOCTYPE configuration SYSTEM "log4j.dtd”>, because “configuration” no longer matches the log4j:configuration root. If I do the modification to the XML document with Perl XML::LibXML, then the DOCTYPE isn’t changed. But because we using Ansible for application deployment, we need to use lxml. Unless I do some ugly stuff. =:-) Kim Grønborg Nielsen E kgn+lxml@network-it.dk

Kim Grønborg Nielsen schrieb am 28.12.2017 um 09:13:
It's not clear to me what "the modification" refers to here, i.e. what kind of changes you are doing to the document that involve "docinfo.doctype". Anyway, I wonder what the "right" behaviour is here. lxml only writes the "internal subset" (i.e. any DTD content inside of the document) if the root tag name matches that of the DTD content, which it doesn't in this case. The root tag name is "configuration", whereas the DOCTYPE refers to "log4j:configuration". The reason is that changing the root tag name or serialising subtrees shouldn't write an orphan DOCTYPE. But since DTDs are not namespace aware, it could be argued that the DTD root name actually includes the prefix, however stupid that is from the point of view of XML namespaces. Is there any consensus how other tools (more than just one) commmonly behave w.r.t. the DTD root name? Stefan
participants (2)
-
Kim Grønborg Nielsen
-
Stefan Behnel