Re: python lxml.objectify gives no attribute access to gco:CharacterString node

Hi Volker,
A week ago I formulated a quite simple lxml.objectify question on stackoverflow and got no answer, yet. https://stackoverflow.com/questions/71212909/python-lxml-objectify-gives-no-... Since I don't get an easy answer I conclude that my feeling about something is wrong in lxml.objectfy is not a mare. So I elevate this topic to the mailing list. Any help appreciated Volker Jaenisch
This is expected and just how lxml.objectify works. Since {http://www.isotc211.org/2005/gco%7DCharacterString is from another namespace than its parent element ({"http://www.isotc211.org/2005/gmd%7DfileIdentifier) it cannot be retrieved from the parent with "simple (dot) attribute lookup".
Instead you have to you use index access (just like you already did) or getattr:
```
from lxml import etree, objectify data = objectify.fromstring(
... """<gmd:MD_Metadata ... xmlns:gmd="http://www.isotc211.org/2005/gmd" ... xmlns:gco="http://www.isotc211.org/2005/gco" > ... gmd:fileIdentifier ... gco:CharacterString2ce585df-df23-45f6-b8e1-184e64e7e3b5</gco:CharacterString> ... </gmd:fileIdentifier> ... gmd:language ... <gmd:LanguageCode codeList="https://www.loc.gov/standards/iso639-2/" codeListValue="ger">ger</gmd:LanguageCode> ... </gmd:language> ... </gmd:MD_Metadata> ... """)
data.fileIdentifier["{http://www.isotc211.org/2005/gco%7DCharacterString"]
'2ce585df-df23-45f6-b8e1-184e64e7e3b5'
getattr(data.fileIdentifier, "{http://www.isotc211.org/2005/gco%7DCharacterString")
'2ce585df-df23-45f6-b8e1-184e64e7e3b5'
```
The reason for this is that obviously {http://www.isotc211.org/2005/gco%7DCharacterString is not a valid Python identifier and it makes sense to restrict unqualified lookup to children from the same namespace.
See also "Namespace handling" in the official docs: https://lxml.de/objectify.html#the-lxml-objectify-api
Best, Holger
Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz
Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Hi Holger!
Thank you very much for the fast response.
Am 28.02.22 um 08:41 schrieb Holger.Joukl@LBBW.de:
The reason for this is that obviously {http://www.isotc211.org/2005/gco%7DCharacterString is not a valid Python identifier and it makes sense to restrict unqualified lookup to children from the same namespace.
I like to disagree on
and it makes sense to restrict unqualified lookup to children from the same namespace
What does the namespace of a node has in common with the namespace of one of its subnodes? Nothing. It is quite common in XML that you borrow from other namespaces.
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property.
{http://www.isotc211.org/2005/gco%7DCharacterString -> gco_CharacterString This works like a charm. Not once I had a corner-case.
The problem lies deeply burrowed in the nature of LXML objectify implementation. Objectify does not really transform the XML into a real python instance hierarchy (as RDFlib does), but directs all attribute access via function calls to the C-libxml core. This is on one hand a desired behavior since one so can change XML on-the-fly and some of the changes are visible as well in the XML as also in the objectified representation. But on the other hand the information what namespace a node belongs to is not persistent in the node and therefore cannot be used for lookup.
This can easily be seen in lxml/objectivy.pyx line 414ff:
cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) return NULL
To find the desired sibling the code loops over all childern and matches (parentNamespace, propertyName) against them.
The correct operation of _findFollowingSibling should IMHO be:
Make a lookup on all children (with the python property name only). If one match is found then return this match. If none or more than one match is found then no answer is possible.
I extended _findFollowingSibling to
cdef tree.xmlNode* _findFollowingSibling(tree.xmlNode* c_node, const_xmlChar* href, const_xmlChar* name, Py_ssize_t index): cdef tree.xmlNode* (*next)(tree.xmlNode*) cdef tree.xmlNode* start_node cdef tree.xmlNode* result_node cdef int found = 0
start_node = c_node if index >= 0: next = cetree.nextElement else: index = -1 - index next = cetree.previousElement # search with namespace while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and \ _tagMatches(c_node, href, name): index = index - 1 if index < 0: return c_node c_node = next(c_node) # search without namespace c_node = start_node while c_node is not NULL: if c_node.type == tree.XML_ELEMENT_NODE and c_node.name == name: index = index - 1 if index < 0: result_node = c_node found += 1 c_node = next(c_node) # check if only one result is found if found == 1: return result_node return NULL
Sorry for my clumsy Cython. But it works perfectly well. I also preserved the notion to look up in the parent namespace first.
node.fileIdentifier.CharacterString
'4157d397-e2c3-4e6e-8a84-0712aa9c1162'
I would really like if someone may test thishttps://github.com/Inqbus/lxml Branch*better-objectify-attributes https://github.com/Inqbus/lxml/tree/better-objectify-attributes *proof of concept. When getting positive answers I would come up with a pull request.
Cheers, Volker

On 1 Mar 2022, at 16:06, Dr. Volker Jaenisch wrote:
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property.
Given how central namespaces are to XML and how often conflicts can occur with abbreviations and prefixes, I don't think what you suggest should be the standard behaviour. It might be fine for a small scope like RDF but I can think of several places in OOXML where it could cause problems.
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
Charlie
-- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

Dear Charlie!
Am 01.03.22 um 16:16 schrieb Charlie Clark:
On 1 Mar 2022, at 16:06, Dr. Volker Jaenisch wrote:
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property.
Given how central namespaces are to XML and how often conflicts can occur with abbreviations and prefixes, I don't think what you suggest should be the standard behaviour.
I agree completely. But it should be an option that can be choose by a configuration setting, or a parameter.
It might be fine for a small scope like RDF
I would not say that RDF is small :-) In fact it uses all the XML namespaces.
but I can think of several places in OOXML where it could cause problems.
May you please be so kind and do test such a szenario.
I think that my change is really save.
* It preserves the former behavior : looking up parent namespace first. (This is generally not correct, see below)
* If one match is fount the match is returned
* If more than one match is found no action is taken : This deals with conflicting namespaces.
* if no match is found no action is taken
So in case we have indeed a conflicting namespace
foo:{http://foo%7D/test
bar:{http://bar%7D/test
</foo:parent>
Lookup of "test" will return foo:test since parent namespace is foo.
</foo:parent>
Lookup of "test" will return bar:test since parent namespace is bar. *Ok this is not correct.* In this case nothing should be returned (as it formerly was in xlml). *Already fixed in GH.*
<parent>
<parent>
Will return nothing since two answers.
I think the logic should be:
*If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.*
Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
You are welcome.
Cheers,
Volker

Hi,
I think the logic should be: If lxml find one matching child by property name then this is the correct answer. If no or more than one child matches not action is taken.
I take it you mean namespace-unqualified property name here? This is not desirable behavior in my book.
Matching against the parent namespace is IMHO in no case correct. Matching against a default namespace if would be a better option.
Again, I have a different opinion here: Matching against the parent namespace for child lookup is exactly the right thing to do (TM). :-)
Still, your suggestion for a namespace-free lookup looks like it could be very useful.
Well, you can always do s.th. like
from lxml import etree, objectify root = objectify.fromstring('<root xmlns:otherns="/other/namespace">otherns:x1</otherns:x><x>2</x></root>') list(root.iterchildren('{*}x'))
[1, 2]
(works with other iterators too, or you could use XPaths (maybe even compiled) with local-name())
I must say I'm pretty fundamentally opposed to the suggestion, FWIW.
So -1 from me.
Best, Holger
Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz
Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Dr. Volker Jaenisch schrieb am 01.03.22 um 16:06:
To find the desired sibling the code loops over all childern and matches (parentNamespace, propertyName) against them.
The correct operation of _findFollowingSibling should IMHO be:
Make a lookup on all children (with the python property name only). If one match is found then return this match. If none or more than one match is found then no answer is possible.
I see a major drawback with this behaviour, and that is non-local dependencies. If you have this XML:
<a:root xmlns:a="A" xmlns:b="B"> <b:ch1/> <b:ch2/> </a:root>
then "root.ch1" would give you the first child. Great, so you use that in your code. Now, someone decides to send you an input document that looks like this:
<a:root xmlns:a="A" xmlns:b="B" xmlns:c="C"> <b:ch1/> <b:ch2/> <c:ch1/> </a:root>
And your code will suddenly fail to find "root.ch1". Depending on what your code does and how it does it, it may fail with an exception, or it may fail silently to find the desired data and just keep working without it.
Note that the content of the XML file that your code is designed to process did not change at all. It's just that some entirely unrelated content was added, in a completely different and unrelated namespace. And it was just externally added to the input data, or maybe just some tiny portion it, without telling you or your code about it. Especially in places with optional content, where different namespaces are already a little more common than elsewhere, this is fairly likely to go unnoticed.
I find this kind of behaviour dangerous enough to restrict the "magic" in the API to what is easy to understand and predict.
Stefan
participants (4)
-
Charlie Clark
-
Dr. Volker Jaenisch
-
Holger.Joukl@LBBW.de
-
Stefan Behnel