I’m using following code to extract DOCTYPE with python 2.7 and lxml 4.1.1:
from lxml import tree
from StringIO import StringIO

if __name__ == '__main__':
doc = etree.parse(StringIO('''<?xml version="1.0"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j = "http://jakarta.apache.org/log4j/" debug="false">
<a>tasty</a>
</log4j:configuration>'''))
print "Type: {}\n".format(doc.docinfo.doctype)

But it returns: Type: <!DOCTYPE configuration SYSTEM "log4j.dtd”>
And not, as I expected: Type: <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

Is it a bug in lxml?
Is there a workaround for getting what I’m expecting?


=:-) Kim Grønborg Nielsen
M kgn+lxml@network-it.dk