
Sidnei da Silva wrote:
I am looking for a way to output internal entities that have been parsed from the original document when writing out a tree, but apparently this is not exposed in any attribute.
Here's an example:
{{{ import lxml.etree
document = """<?xml version="1.0"?> <!DOCTYPE application [ <!ENTITY nbsp "\ "> ]> <application> </application> """
tree = lxml.etree.fromstring(document) print tree.getroottree().docinfo.doctype }}}
I would expect this to output: {{{ <!DOCTYPE application [ <!ENTITY nbsp "\ "> ]> }}}
But instead it gives me:
{{{ <!DOCTYPE application> }}}
Is it a bug or I'm not looking at the right place?
What you are looking for is the internal subset of the document, which is not (really) part of the DOCTYPE itself. It's available through the "docinfo.internalDTD" property. However, lxml.etree doesn't expose the content of the DTD, so this is currently only usable for validation (i.e. not very helpful in your case). What you could try is to parse the document without resolving the entities, then traverse the Entity elements and collect their names in a set. That will not give you the resolved entity values, though... I think it would be nice if tostring() could serialise DTDs, but I doubt that there are so many use cases for that. In your case, you'd then have to parse the DTD yourself, which you could also do by clearing the root node and serialising the document to unicode. Stefan