docinfo.doctype don't return the original doctype
I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1: from lxml import tree from StringIO import StringIO if __name__ == '__main__': doc = etree.parse(StringIO('''<?xml version="1.0"?> <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false"> <a>tasty</a> </log4j:configuration>''')) print "Type: {}\n".format(doc.docinfo.doctype) But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”> And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> Is it a bug in lxml? Is there a workaround for getting what I’m expecting? =:-) Kim Grønborg Nielsen M kgn+lxml@network-it.dk
Am 27. Dezember 2017 19:35:20 MEZ schrieb "Kim Grønborg Nielsen":
I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1: from lxml import tree from StringIO import StringIO
if __name__ == '__main__': doc = etree.parse(StringIO('''<?xml version="1.0"?> <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false"> <a>tasty</a> </log4j:configuration>''')) print "Type: {}\n".format(doc.docinfo.doctype)
But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”> And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
The DOCTYPE looks correct to me. DTDs are not namespace aware and do not know our care about prefixes. Stefan
But JBoss do care, and fails loading the XML document when DOCTYPE is changed from <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd”> to <!DOCTYPE configuration SYSTEM "log4j.dtd”>, because “configuration” no longer matches the log4j:configuration root. If I do the modification to the XML document with Perl XML::LibXML, then the DOCTYPE isn’t changed. But because we using Ansible for application deployment, we need to use lxml. Unless I do some ugly stuff. =:-) Kim Grønborg Nielsen E kgn+lxml@network-it.dk
Begin forwarded message:
From: Stefan Behnel <stefan_ml@behnel.de> Subject: Re: [lxml] docinfo.doctype don't return the original doctype Date: 28 December 2017 at 08.35.43 CET To: lxml@lxml.de
Am 27. Dezember 2017 19:35:20 MEZ schrieb "Kim Grønborg Nielsen":
I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1: from lxml import tree from StringIO import StringIO
if __name__ == '__main__': doc = etree.parse(StringIO('''<?xml version="1.0"?> <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false"> <a>tasty</a> </log4j:configuration>''')) print "Type: {}\n".format(doc.docinfo.doctype)
But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”> And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
The DOCTYPE looks correct to me. DTDs are not namespace aware and do not know our care about prefixes.
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ <http://lxml.de/> lxml@lxml.de <mailto:lxml@lxml.de> https://mailman-mail5.webfaction.com/listinfo/lxml <https://mailman-mail5.webfaction.com/listinfo/lxml>
Kim Grønborg Nielsen schrieb am 28.12.2017 um 09:13:
From: Stefan Behnel Am 27. Dezember 2017 19:35:20 MEZ schrieb "Kim Grønborg Nielsen":
I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1: from lxml import tree from StringIO import StringIO
if __name__ == '__main__': doc = etree.parse(StringIO('''<?xml version="1.0"?> <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> <log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false"> <a>tasty</a> </log4j:configuration>''')) print "Type: {}\n".format(doc.docinfo.doctype)
But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”> And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
The DOCTYPE looks correct to me. DTDs are not namespace aware and do not know our care about prefixes.
But JBoss do care, and fails loading the XML document when DOCTYPE is changed from <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd”> to <!DOCTYPE configuration SYSTEM "log4j.dtd”>, because “configuration” no longer matches the log4j:configuration root.
If I do the modification to the XML document with Perl XML::LibXML, then the DOCTYPE isn’t changed. But because we using Ansible for application deployment, we need to use lxml. Unless I do some ugly stuff.
It's not clear to me what "the modification" refers to here, i.e. what kind of changes you are doing to the document that involve "docinfo.doctype". Anyway, I wonder what the "right" behaviour is here. lxml only writes the "internal subset" (i.e. any DTD content inside of the document) if the root tag name matches that of the DTD content, which it doesn't in this case. The root tag name is "configuration", whereas the DOCTYPE refers to "log4j:configuration". The reason is that changing the root tag name or serialising subtrees shouldn't write an orphan DOCTYPE. But since DTDs are not namespace aware, it could be argued that the DTD root name actually includes the prefix, however stupid that is from the point of view of XML namespaces. Is there any consensus how other tools (more than just one) commmonly behave w.r.t. the DTD root name? Stefan
participants (2)
-
Kim Grønborg Nielsen
-
Stefan Behnel