Re: python lxml.objectify gives no attribute access to gco:CharacterString node

Hi Volker,
The reason for this is that obviously {http://www.isotc211.org/2005/gco}CharacterString is not a valid Python identifier and it makes sense to restrict unqualified lookup to children from the same namespace.
I like to disagree on
and it makes sense to restrict unqualified lookup to children from the same namespace
I beg to differ :-).
What does the namespace of a node has in common with the namespace of one of its subnodes? Nothing. It is quite common in XML that you borrow from other namespaces.
I'd rather assume the vast majority of XML documents do not consist of many different namespaces and heavily "oscillate" between parents and children from different namespaces. But I've no data to back up that claim. For me, simply using parent['{/other/ns}child'] or getattr(parent, '{/other/ns}child') syntax just works. Not as beautiful as parent.child but not ugly, either.
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property. {http://www.isotc211.org/2005/gco}CharacterString -> gco_CharacterString
Well, but they're not adding the namespace but the ns-prefix. lxml uses Clarke-notation and qualified tag names everywhere, which is less error-prone and preferable in my experience (http://www.jclark.com/xml/xmlns.htm). No problems with namespace <--> prefix indirections, multiple prefixes for the same namespace, etc. Note that to the best of my knowledge not every allowed prefix is a valid python identifier (I think all XML name chars except ":" are allowed). So apart from tedious namespace <--> prefix indirection handling you'd also needed to cater for rules to replace characters (e.g. <my-nsprefix:root xmlns:my-nsprefix="/my/ns"/> --> my_ns_root?). If you wanted to lookup children using a <ns-prefix>_<unqualified name> attribute name syntax, that is.
The problem lies deeply burrowed in the nature of LXML objectify implementation. Objectify does not really transform the XML into a real python instance hierarchy (as RDFlib does), but directs all attribute access via function calls to the C-libxml core. This is on one hand a desired behavior since one so can change XML on-the-fly and some of the changes are visible as well in the XML as also in the objectified representation. But on the other hand the information what namespace a node belongs to is not persistent in the node and therefore cannot be used for lookup.
To me this is a feature rather than a problem and it's been a design decision from day one (https://mail.python.org/archives/list/lxml@python.org/message/GOTPAWC4MHI5LV...). Some others seem to agree: https://zato.io/blog/posts/saving-time-with-a-pythonic-api-of-lxmlobjectify..... But I'm obviously and deeply biased ;-) Best regards, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Hi Holger! To motivate the problem. We are dealign with lots of XML files from different origins. Debugging becomes a great hassle if you are not able e.g. in your PyCharm IDE to navigate the XML tree your parser a currently processing. Even worse if some nodes do not seem to even exist. Do they really not exist or are they just been omitted by lxml. It's really a pain since in the data (in international standard format ISO19139) we process lxml shows only a really small portion of the data since nearly all the data holding nodes have a namespace which differs from its parents namespace. Here a sample file https://backend.datenadler.de/31591bca-bb40-4d8a-98ad-35efc37524c9.xml <https://backend.datenadler.de/31591bca-bb40-4d8a-98ad-35efc37524c9.xml/view> Yes there may be problems with conflicting abbreviations and for these cases the Clarke-notation is the only solution here. But also for these cases I am sure a python-attribute notation can be found. But omitting data in a not quite logical fashion is IMHO a no go for a data representation. Am 01.03.22 um 17:17 schrieb Holger.Joukl@LBBW.de:
What does the namespace of a node has in common with the namespace of one of its subnodes? Nothing. It is quite common in XML that you borrow from other namespaces. I'd rather assume the vast majority of XML documents do not consist of many different namespaces and heavily "oscillate" between parents and children from different namespaces. But I've no data to back up that claim. Have a look at my example. There are millions of such ISO files out there. For me, simply using parent['{/other/ns}child'] or getattr(parent, '{/other/ns}child') syntax just works. Not as beautiful as parent.child but not ugly, either.
It is not that I like a more convenient way to address the data. To address the data I use xpath. It is purely the fact that I cannot use the objectified data in a debugger while debugging, that drives me mad. Software developer spend more than 80% their time debugging and only 5-10% programming. So the emphasis of a good library should be to make it easy to debug.
Other namespace based python libs like for instance RDFlib solve this problem generically by adding the namespace to the python property. {http://www.isotc211.org/2005/gco}CharacterString -> gco_CharacterString Well, but they're not adding the namespace but the ns-prefix. lxml uses Clarke-notation and qualified tag names everywhere, which is less error-prone and preferable in my experience (http://www.jclark.com/xml/xmlns.htm). No problems with namespace <--> prefix indirections, multiple prefixes for the same namespace, etc.
Note that to the best of my knowledge not every allowed prefix is a valid python identifier (I think all XML name chars except ":" are allowed). So apart from tedious namespace <--> prefix indirection handling you'd also needed to cater for rules to replace characters (e.g. <my-nsprefix:root xmlns:my-nsprefix="/my/ns"/> --> my_ns_root?). If you wanted to lookup children using a <ns-prefix>_<unqualified name> attribute name syntax, that is.
Your argument is true. Not in all cases one will be capable to find a python compatible identifier name. But in the my current case (dealing with geodata) nearly 100% of my data is not visible at all. So we are not talking about corner cases here. With my 10 liner patch of lxml 100% of my data is visible. My collegues are already using the patched LXML module and are quite happy with it. As I wrote in an other E-Mail. I do not think that my patch is no better than the current handling of namespace. Neither my patched version nor the current handling do solve the problem completely. So at first their should be a choice for the users of lxml which flavour of namespace handling they prefer. And in addition I think it is time to go for a real solution of the namespace problem in LXML. LXML is the most advanced XML lib for python and there is no real competitor. So it is a bit of a responsibility for LXML to deliver the best possible solution for the python community :-). I have no idea how other programing languages solve the namespace -> object mapping. Probably we can learn from java or go or even javascript?
The problem lies deeply burrowed in the nature of LXML objectify implementation. Objectify does not really transform the XML into a real python instance hierarchy (as RDFlib does), but directs all attribute access via function calls to the C-libxml core. This is on one hand a desired behavior since one so can change XML on-the-fly and some of the changes are visible as well in the XML as also in the objectified representation. But on the other hand the information what namespace a node belongs to is not persistent in the node and therefore cannot be used for lookup. To me this is a feature rather than a problem and it's been a design decision from day one (https://mail.python.org/archives/list/lxml@python.org/message/GOTPAWC4MHI5LV...). Some others seem to agree:https://zato.io/blog/posts/saving-time-with-a-pythonic-api-of-lxmlobjectify.....
But I'm obviously and deeply biased ;-)
That a decision for a software design is from 2006 is not really a selling point to buy this design. Python wheels use email-headers to describe their metadata. This design is from 2004 and it is IMHO not cool at all, if I compare it to package system of other languages utilizing XML, JSON, YAML. So lets not look back but forward to shape better software. Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
participants (2)
-
Dr. Volker Jaenisch
-
Holger.Joukl@LBBW.de