[lxml-dev] docinfo.doctype doesn't include internal entities?

Hi there, I am looking for a way to output internal entities that have been parsed from the original document when writing out a tree, but apparently this is not exposed in any attribute. Here's an example: {{{ import lxml.etree document = """<?xml version="1.0"?> <!DOCTYPE application [ <!ENTITY nbsp "\ "> ]> <application> </application> """ tree = lxml.etree.fromstring(document) print tree.getroottree().docinfo.doctype }}} I would expect this to output: {{{ <!DOCTYPE application [ <!ENTITY nbsp "\ "> ]> }}} But instead it gives me: {{{ <!DOCTYPE application> }}} Is it a bug or I'm not looking at the right place? -- Sidnei da Silva

Sidnei da Silva wrote:
What you are looking for is the internal subset of the document, which is not (really) part of the DOCTYPE itself. It's available through the "docinfo.internalDTD" property. However, lxml.etree doesn't expose the content of the DTD, so this is currently only usable for validation (i.e. not very helpful in your case). What you could try is to parse the document without resolving the entities, then traverse the Entity elements and collect their names in a set. That will not give you the resolved entity values, though... I think it would be nice if tostring() could serialise DTDs, but I doubt that there are so many use cases for that. In your case, you'd then have to parse the DTD yourself, which you could also do by clearing the root node and serialising the document to unicode. Stefan

On Sat, 2009-04-18 at 08:46 +0200, Stefan Behnel wrote:
Hello, I'm sorry to be resurrecting an ancient thread, but it seemed the easiest way to bring up the fact that I've recently come up with a use case for exactly the feature mentioned here, namely serializing internal DTD subsets. I've been writing a round trip converter for a personal XML shorthand, and internal DTD subsets are the only thing I haven't been able to come up with a good workaround for being able to pull out of the original document, do my modifications, and create an identical new document from. So far the best I've been able to do is destructively modify a copy of the document to the point where I've a reasonable chance of writing my own string parser to pull out the internal DTD subset, a parser which is looking unfortunately complicated to be able to deal with multiple top level elements (comments being the most common for me). Really, I'd be happy in the simplest case if docinfo.doctype included the internal DTD subset exactly as defined in the original document (parsed or not), as then it'd be reasonably easy to at least stick things back together at the string level. Is this the kind of thing you'd accept a wishlist bugtracker item on? Somewhat related, I was surprised to discover that the TreeBuilder API doesn't deal well with additional top level elements. For example, Python 2.6.2 (r262:71600, May 29 2009, 09:48:09) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
I was expecting that the TreeBuilder class was actually for creating XML documents, but I guess it's actually meant for XML fragments? I expected this:
etree.tostring( e ) '<element/>'
But when I went to retrieve the root tree, I was surprised that my other top level elements were missing.
etree.tostring( e.getroottree( ) ) '<element/>'
But since it also allows you to create multiple top level elements with start and end, it obviously doesn't care about the restrictions of creating an XML document. This actually mattered to me, as my XML here makes heavy use of top level comments for documentation. Anyway, not a big deal, I figured I'd just keep track of all the top level parts myself, and create an ElementTree manually. Only, it doesn't look like there's any way to add top level elements like doctype information or comments or processing instructions without first serializing all the parts to strings, sticking them together, and running them back through the parser again. I can't find any way to duplicate what the parser does through the API, and am hoping I'm just missing some obscure corner of the ElementTree API that would let me build this programatically. Sure, I can read them using ElementTree.getroot( ).itersiblings( ), but I couldn't find any way to create them or doctype information without resorting to string parsing. So, yeah, really just a diary of my misunderstandings of the TreeBuilder API, and my attempts to work around it. Hopefully I'm missing something obvious. I should also add a note here, in thanks of all the effort you've put into lxml. I've been using it daily for over 2 years now, and I can't imagine programming XML with python using anything else. Even the original ElementTree seems limited in comparison now, much less ending up in javascript looking at DOM code. It's been the most useful and best supported library I depend on. Thank you. -- John Krukoff <jkrukoff@ltgc.com> Land Title Guarantee Company

Sidnei da Silva wrote:
What you are looking for is the internal subset of the document, which is not (really) part of the DOCTYPE itself. It's available through the "docinfo.internalDTD" property. However, lxml.etree doesn't expose the content of the DTD, so this is currently only usable for validation (i.e. not very helpful in your case). What you could try is to parse the document without resolving the entities, then traverse the Entity elements and collect their names in a set. That will not give you the resolved entity values, though... I think it would be nice if tostring() could serialise DTDs, but I doubt that there are so many use cases for that. In your case, you'd then have to parse the DTD yourself, which you could also do by clearing the root node and serialising the document to unicode. Stefan

On Sat, 2009-04-18 at 08:46 +0200, Stefan Behnel wrote:
Hello, I'm sorry to be resurrecting an ancient thread, but it seemed the easiest way to bring up the fact that I've recently come up with a use case for exactly the feature mentioned here, namely serializing internal DTD subsets. I've been writing a round trip converter for a personal XML shorthand, and internal DTD subsets are the only thing I haven't been able to come up with a good workaround for being able to pull out of the original document, do my modifications, and create an identical new document from. So far the best I've been able to do is destructively modify a copy of the document to the point where I've a reasonable chance of writing my own string parser to pull out the internal DTD subset, a parser which is looking unfortunately complicated to be able to deal with multiple top level elements (comments being the most common for me). Really, I'd be happy in the simplest case if docinfo.doctype included the internal DTD subset exactly as defined in the original document (parsed or not), as then it'd be reasonably easy to at least stick things back together at the string level. Is this the kind of thing you'd accept a wishlist bugtracker item on? Somewhat related, I was surprised to discover that the TreeBuilder API doesn't deal well with additional top level elements. For example, Python 2.6.2 (r262:71600, May 29 2009, 09:48:09) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
I was expecting that the TreeBuilder class was actually for creating XML documents, but I guess it's actually meant for XML fragments? I expected this:
etree.tostring( e ) '<element/>'
But when I went to retrieve the root tree, I was surprised that my other top level elements were missing.
etree.tostring( e.getroottree( ) ) '<element/>'
But since it also allows you to create multiple top level elements with start and end, it obviously doesn't care about the restrictions of creating an XML document. This actually mattered to me, as my XML here makes heavy use of top level comments for documentation. Anyway, not a big deal, I figured I'd just keep track of all the top level parts myself, and create an ElementTree manually. Only, it doesn't look like there's any way to add top level elements like doctype information or comments or processing instructions without first serializing all the parts to strings, sticking them together, and running them back through the parser again. I can't find any way to duplicate what the parser does through the API, and am hoping I'm just missing some obscure corner of the ElementTree API that would let me build this programatically. Sure, I can read them using ElementTree.getroot( ).itersiblings( ), but I couldn't find any way to create them or doctype information without resorting to string parsing. So, yeah, really just a diary of my misunderstandings of the TreeBuilder API, and my attempts to work around it. Hopefully I'm missing something obvious. I should also add a note here, in thanks of all the effort you've put into lxml. I've been using it daily for over 2 years now, and I can't imagine programming XML with python using anything else. Even the original ElementTree seems limited in comparison now, much less ending up in javascript looking at DOM code. It's been the most useful and best supported library I depend on. Thank you. -- John Krukoff <jkrukoff@ltgc.com> Land Title Guarantee Company
participants (3)
-
John Krukoff
-
Sidnei da Silva
-
Stefan Behnel