cleanup namespaces and XML elements with QNames
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
Hi: using lxml 3.3.5, we have XML documents with elements having QName type values. We would like to implement etree.cleanup_namespaces but are finding that this affects downstream parsers/validators complaining about undeclared namespace prefixes. Below is an isolated example: from lxml import etree nsmap = { 'ogc': 'http://www.opengis.net/ogc', 'ows': 'http://www.opengis.net/ows', 'gml': 'http://www.opengis.net/gml' } root = etree.Element('{http://www.opengis.net/ogc}Filter', nsmap=nsmap) typename = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName') typename.text = etree.QName('http://www.opengis.net/gml', 'Envelope') typename2 = etree.SubElement(root, '{http://www.opengis.net/ogc}typeName') typename2.text = etree.QName('{http://www.opengis.net/gml}Envelope') print etree.tostring(root, pretty_print=True) etree.cleanup_namespaces(root) print etree.tostring(root, pretty_print=True) Here we would like the gml namespace declaration, but it looks like cleanup_namespaces is throwing out namespace declarations even if they apply to element content. Are there any workarounds we can use/implement to cleanup unused namespaces while preserving those for element content per above? Thanks in advance. ..Tom
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom Kralidis schrieb am 12.05.2015 um 20:03:
I'm getting this as output: <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc"> <ogc:typeName>gml:Envelope</ogc:typeName> <ogc:typeName>gml:Envelope</ogc:typeName> </ogc:Filter> So your problem is that "gml:Envelope" is actually text content and not structure, which means that lxml ignores it when removing unused namespace declarations (and the "gml" prefix is really unused, except that some downstream processor wants it to be there). QName() is only special at the time of assignment. Afterwards, it's turned into regular text content. So there is no way lxml could figure out later that it's something it needs to care about. So, the work-around I would suggest is to not call cleanup_namespaces() and to keep the namespace declarations as they are in the tree. If repetition is a problem, use compression. Or is there an actual reason why additional namespace declarations hurt here? Stefan
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, 13 May 2015, Stefan Behnel wrote:
Is there value to adding an optional argument to etree.cleanup_namespaces (like preserve_qname_text_content=False or something) which could be implemented? This would then require some sort of register of element text which is QName'd. I'm guessing this is too exotic, but I can file an enhancement ticket if there is interest.
They don't hurt, functionally. Our XML root elements are huge (and often times bigger than the rest of a given payload). Given the way we serialize (big namespace map), we can live with this for now via compression. Thanks ..Tom
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom Kralidis schrieb am 13.05.2015 um 16:34:
I'd rather add an option to prevent a specific sequence/set of prefixes from being removed from the tree. Together with the new option "top_nsmap" in the next version that allows you to move certain declarations to the root element and prevent them from being dropped. Finding out which prefixes are used in text content is an application specific problem and should be handled on that side. I'd be surprised if this needed more than 5 lines of Python code in your case.
I can file an enhancement ticket if there is interest.
Alternatively, you could implement it and send a pull request. :) Stefan
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, 13 May 2015, Stefan Behnel wrote:
Thanks for the info. Something like the below would do the trick I'm guessing? etree.cleanup_namespaces(root) for xpath in root.xpath('//text()'): if ':' in xpath: prefix, _ = xpath.split(':') if prefix in nsmap: root.nsmap[prefix] = nsmap[prefix] not sure if/how expensive this would be or if there are more efficient approaches? Thanks ..Tom
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, May 13, 2015 at 3:42 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I've issued a PR here: https://github.com/lxml/lxml/pull/167 for consideration. Thanks ..Tom
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom Kralidis schrieb am 12.05.2015 um 20:03:
I'm getting this as output: <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc"> <ogc:typeName>gml:Envelope</ogc:typeName> <ogc:typeName>gml:Envelope</ogc:typeName> </ogc:Filter> So your problem is that "gml:Envelope" is actually text content and not structure, which means that lxml ignores it when removing unused namespace declarations (and the "gml" prefix is really unused, except that some downstream processor wants it to be there). QName() is only special at the time of assignment. Afterwards, it's turned into regular text content. So there is no way lxml could figure out later that it's something it needs to care about. So, the work-around I would suggest is to not call cleanup_namespaces() and to keep the namespace declarations as they are in the tree. If repetition is a problem, use compression. Or is there an actual reason why additional namespace declarations hurt here? Stefan
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, 13 May 2015, Stefan Behnel wrote:
Is there value to adding an optional argument to etree.cleanup_namespaces (like preserve_qname_text_content=False or something) which could be implemented? This would then require some sort of register of element text which is QName'd. I'm guessing this is too exotic, but I can file an enhancement ticket if there is interest.
They don't hurt, functionally. Our XML root elements are huge (and often times bigger than the rest of a given payload). Given the way we serialize (big namespace map), we can live with this for now via compression. Thanks ..Tom
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom Kralidis schrieb am 13.05.2015 um 16:34:
I'd rather add an option to prevent a specific sequence/set of prefixes from being removed from the tree. Together with the new option "top_nsmap" in the next version that allows you to move certain declarations to the root element and prevent them from being dropped. Finding out which prefixes are used in text content is an application specific problem and should be handled on that side. I'd be surprised if this needed more than 5 lines of Python code in your case.
I can file an enhancement ticket if there is interest.
Alternatively, you could implement it and send a pull request. :) Stefan
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, 13 May 2015, Stefan Behnel wrote:
Thanks for the info. Something like the below would do the trick I'm guessing? etree.cleanup_namespaces(root) for xpath in root.xpath('//text()'): if ':' in xpath: prefix, _ = xpath.split(':') if prefix in nsmap: root.nsmap[prefix] = nsmap[prefix] not sure if/how expensive this would be or if there are more efficient approaches? Thanks ..Tom
data:image/s3,"s3://crabby-images/8d641/8d641b5ef88336d15ab4bda31d4ed7ca4c2cb44c" alt=""
On Wed, May 13, 2015 at 3:42 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I've issued a PR here: https://github.com/lxml/lxml/pull/167 for consideration. Thanks ..Tom
participants (2)
-
Stefan Behnel
-
Tom Kralidis