Re: python lxml.objectify gives no attribute access to gco:CharacterString node
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi Volker,
This is expected and just how lxml.objectify works. Since {http://www.isotc211.org/2005/gco}CharacterString is from another namespace than its parent element ({"http://www.isotc211.org/2005/gmd}fileIdentifier) it cannot be retrieved from the parent with "simple (dot) attribute lookup". Instead you have to you use index access (just like you already did) or getattr: ```
The reason for this is that obviously {http://www.isotc211.org/2005/gco}CharacterString is not a valid Python identifier and it makes sense
to restrict unqualified lookup to children from the same namespace.
See also "Namespace handling" in the official docs: https://lxml.de/objectify.html#the-lxml-objectify-api
Best,
Holger
Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz
Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
data:image/s3,"s3://crabby-images/d8d0c/d8d0c9b1aa862d19a7ac558ec751be9273b061b4" alt=""
Hi Holger! Thank you very much for the fast response. Am 28.02.22 um 08:41 schrieb Holger.Joukl@LBBW.de:
The reason for this is that obviously {http://www.isotc211.org/2005/gco}CharacterString is not a valid Python identifier and it makes sense to restrict unqualified lookup to children from the same namespace.
I like to disagree on
and it makes sense to restrict unqualified lookup to children from the same namespace
What does the namespace of a node has in common with the namespace of one of its subnodes? Nothing. It is quite common in XML that you borrow from other namespaces. Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property. {http://www.isotc211.org/2005/gco}CharacterString -> gco_CharacterString This works like a charm. Not once I had a corner-case. The problem lies deeply burrowed in the nature of LXML objectify implementation. Objectify does not really transform the XML into a real python instance hierarchy (as RDFlib does), but directs all attribute access via function calls to the C-libxml core. This is on one hand a desired behavior since one so can change XML on-the-fly and some of the changes are visible as well in the XML as also in the objectified representation. But on the other hand the information what namespace a node belongs to is not persistent in the node and therefore cannot be used for lookup. This can easily be seen in lxml/objectivy.pyx line 414ff: cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) return NULL To find the desired sibling the code loops over all childern and matches (parentNamespace, propertyName) against them. The correct operation of _findFollowingSibling should IMHO be: Make a lookup on all children (with the python property name only). If one match is found then return this match. If none or more than one match is found then no answer is possible. I extended _findFollowingSibling to cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) cdef tree.xmlNode* start_node cdef tree.xmlNode* result_node cdef int found = 0 start_node = c_node if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement # search with namespace while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) # search without namespace c_node = start_node while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and c_node.name == name: index = index - 1 if index < 0: result_node = c_node found += 1 c_node = next(c_node) # check if only one result is found if found == 1: return result_node return NULL Sorry for my clumsy Cython. But it works perfectly well. I also preserved the notion to look up in the parent namespace first.
node.fileIdentifier.CharacterString '4157d397-e2c3-4e6e-8a84-0712aa9c1162'
I would really like if someone may test thishttps://github.com/Inqbus/lxml Branch*better-objectify-attributes <https://github.com/Inqbus/lxml/tree/better-objectify-attributes> *proof of concept. When getting positive answers I would come up with a pull request. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
data:image/s3,"s3://crabby-images/863b1/863b1190bbdaf32564c8b302dc468286f365d9bb" alt=""
On 1 Mar 2022, at 16:06, Dr. Volker Jaenisch wrote:
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property.
Given how central namespaces are to XML and how often conflicts can occur with abbreviations and prefixes, I don't think what you suggest should be the standard behaviour. It might be fine for a small scope like RDF but I can think of several places in OOXML where it could cause problems. Still, your suggestion for a namespace-free lookup looks like it could be very useful. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
data:image/s3,"s3://crabby-images/d8d0c/d8d0c9b1aa862d19a7ac558ec751be9273b061b4" alt=""
Dear Charlie! Am 01.03.22 um 16:16 schrieb Charlie Clark:
I agree completely. But it should be an option that can be choose by a configuration setting, or a parameter.
May you please be so kind and do test such a szenario. I think that my change is really save. * It preserves the former behavior : looking up parent namespace first. (This is generally not correct, see below) * If one match is fount the match is returned * If more than one match is found no action is taken : This deals with conflicting namespaces. * if no match is found no action is taken So in case we have indeed a conflicting namespace foo:{http://foo}/test bar:{http://bar}/test <foo:parent> <foo:test> </foo:parent> Lookup of "test" will return <foo:test> since parent namespace is foo. <bar:parent> <foo:test> </foo:parent> Lookup of "test" will return <bar:test> since parent namespace is bar. *Ok this is not correct.* In this case nothing should be returned (as it formerly was in xlml). *Already fixed in GH.* <parent> <foo:test> <parent> Will return nothing since two answers. I think the logic should be: *If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.* Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
You are welcome. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi,
I think the logic should be: If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.
I take it you mean namespace-unqualified property name here? This is not desirable behavior in my book.
Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Again, I have a different opinion here: Matching against the parent namespace for child lookup is exactly the right thing to do (TM). :-)
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
Well, you can always do s.th. like
(works with other iterators too, or you could use XPaths (maybe even compiled) with local-name()) I must say I'm pretty fundamentally opposed to the suggestion, FWIW. So -1 from me. Best, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Dr. Volker Jaenisch schrieb am 01.03.22 um 16:06:
I see a major drawback with this behaviour, and that is non-local dependencies. If you have this XML: <a:root xmlns:a="A" xmlns:b="B"> <b:ch1/> <b:ch2/> </a:root> then "root.ch1" would give you the first child. Great, so you use that in your code. Now, someone decides to send you an input document that looks like this: <a:root xmlns:a="A" xmlns:b="B" xmlns:c="C"> <b:ch1/> <b:ch2/> <c:ch1/> </a:root> And your code will suddenly fail to find "root.ch1". Depending on what your code does and how it does it, it may fail with an exception, or it may fail silently to find the desired data and just keep working without it. Note that the content of the XML file that your code is designed to process did not change at all. It's just that some entirely unrelated content was added, in a completely different and unrelated namespace. And it was just externally added to the input data, or maybe just some tiny portion it, without telling you or your code about it. Especially in places with optional content, where different namespaces are already a little more common than elsewhere, this is fairly likely to go unnoticed. I find this kind of behaviour dangerous enough to restrict the "magic" in the API to what is easy to understand and predict. Stefan
data:image/s3,"s3://crabby-images/d8d0c/d8d0c9b1aa862d19a7ac558ec751be9273b061b4" alt=""
Hi Holger! Thank you very much for the fast response. Am 28.02.22 um 08:41 schrieb Holger.Joukl@LBBW.de:
The reason for this is that obviously {http://www.isotc211.org/2005/gco}CharacterString is not a valid Python identifier and it makes sense to restrict unqualified lookup to children from the same namespace.
I like to disagree on
and it makes sense to restrict unqualified lookup to children from the same namespace
What does the namespace of a node has in common with the namespace of one of its subnodes? Nothing. It is quite common in XML that you borrow from other namespaces. Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property. {http://www.isotc211.org/2005/gco}CharacterString -> gco_CharacterString This works like a charm. Not once I had a corner-case. The problem lies deeply burrowed in the nature of LXML objectify implementation. Objectify does not really transform the XML into a real python instance hierarchy (as RDFlib does), but directs all attribute access via function calls to the C-libxml core. This is on one hand a desired behavior since one so can change XML on-the-fly and some of the changes are visible as well in the XML as also in the objectified representation. But on the other hand the information what namespace a node belongs to is not persistent in the node and therefore cannot be used for lookup. This can easily be seen in lxml/objectivy.pyx line 414ff: cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) return NULL To find the desired sibling the code loops over all childern and matches (parentNamespace, propertyName) against them. The correct operation of _findFollowingSibling should IMHO be: Make a lookup on all children (with the python property name only). If one match is found then return this match. If none or more than one match is found then no answer is possible. I extended _findFollowingSibling to cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) cdef tree.xmlNode* start_node cdef tree.xmlNode* result_node cdef int found = 0 start_node = c_node if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement # search with namespace while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) # search without namespace c_node = start_node while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and c_node.name == name: index = index - 1 if index < 0: result_node = c_node found += 1 c_node = next(c_node) # check if only one result is found if found == 1: return result_node return NULL Sorry for my clumsy Cython. But it works perfectly well. I also preserved the notion to look up in the parent namespace first.
node.fileIdentifier.CharacterString '4157d397-e2c3-4e6e-8a84-0712aa9c1162'
I would really like if someone may test thishttps://github.com/Inqbus/lxml Branch*better-objectify-attributes <https://github.com/Inqbus/lxml/tree/better-objectify-attributes> *proof of concept. When getting positive answers I would come up with a pull request. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
data:image/s3,"s3://crabby-images/863b1/863b1190bbdaf32564c8b302dc468286f365d9bb" alt=""
On 1 Mar 2022, at 16:06, Dr. Volker Jaenisch wrote:
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property.
Given how central namespaces are to XML and how often conflicts can occur with abbreviations and prefixes, I don't think what you suggest should be the standard behaviour. It might be fine for a small scope like RDF but I can think of several places in OOXML where it could cause problems. Still, your suggestion for a namespace-free lookup looks like it could be very useful. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
data:image/s3,"s3://crabby-images/d8d0c/d8d0c9b1aa862d19a7ac558ec751be9273b061b4" alt=""
Dear Charlie! Am 01.03.22 um 16:16 schrieb Charlie Clark:
I agree completely. But it should be an option that can be choose by a configuration setting, or a parameter.
May you please be so kind and do test such a szenario. I think that my change is really save. * It preserves the former behavior : looking up parent namespace first. (This is generally not correct, see below) * If one match is fount the match is returned * If more than one match is found no action is taken : This deals with conflicting namespaces. * if no match is found no action is taken So in case we have indeed a conflicting namespace foo:{http://foo}/test bar:{http://bar}/test <foo:parent> <foo:test> </foo:parent> Lookup of "test" will return <foo:test> since parent namespace is foo. <bar:parent> <foo:test> </foo:parent> Lookup of "test" will return <bar:test> since parent namespace is bar. *Ok this is not correct.* In this case nothing should be returned (as it formerly was in xlml). *Already fixed in GH.* <parent> <foo:test> <parent> Will return nothing since two answers. I think the logic should be: *If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.* Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
You are welcome. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi,
I think the logic should be: If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.
I take it you mean namespace-unqualified property name here? This is not desirable behavior in my book.
Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Again, I have a different opinion here: Matching against the parent namespace for child lookup is exactly the right thing to do (TM). :-)
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
Well, you can always do s.th. like
(works with other iterators too, or you could use XPaths (maybe even compiled) with local-name()) I must say I'm pretty fundamentally opposed to the suggestion, FWIW. So -1 from me. Best, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Dr. Volker Jaenisch schrieb am 01.03.22 um 16:06:
I see a major drawback with this behaviour, and that is non-local dependencies. If you have this XML: <a:root xmlns:a="A" xmlns:b="B"> <b:ch1/> <b:ch2/> </a:root> then "root.ch1" would give you the first child. Great, so you use that in your code. Now, someone decides to send you an input document that looks like this: <a:root xmlns:a="A" xmlns:b="B" xmlns:c="C"> <b:ch1/> <b:ch2/> <c:ch1/> </a:root> And your code will suddenly fail to find "root.ch1". Depending on what your code does and how it does it, it may fail with an exception, or it may fail silently to find the desired data and just keep working without it. Note that the content of the XML file that your code is designed to process did not change at all. It's just that some entirely unrelated content was added, in a completely different and unrelated namespace. And it was just externally added to the input data, or maybe just some tiny portion it, without telling you or your code about it. Especially in places with optional content, where different namespaces are already a little more common than elsewhere, this is fairly likely to go unnoticed. I find this kind of behaviour dangerous enough to restrict the "magic" in the API to what is easy to understand and predict. Stefan
participants (4)
-
Charlie Clark
-
Dr. Volker Jaenisch
-
Holger.Joukl@LBBW.de
-
Stefan Behnel