Re: python lxml.objectify gives no attribute access to gco:CharacterString node
![](https://secure.gravatar.com/avatar/bc4259003fffe3e1cb3543db975bea0d.jpg?s=120&d=mm&r=g)
Hi, Stefan wrote:
Note that the content of the XML file that your code is designed to process did not change at all. It's just that some entirely unrelated content was added, in a completely different and unrelated namespace. And it was just externally added to the input data, or maybe just some tiny portion it, without telling you or your code about it. Especially in places with optional content, where different namespaces are already a little more common than elsewhere, this is fairly likely to go unnoticed.
I find this kind of behaviour dangerous enough to restrict the "magic" in the API to what is easy to understand and predict.
Any magic namespace prefix-based lookup scheme can be dangerous in a similar vein IMHO: E.g.
root = objectify.fromstring(""" ... <a:root xmlns:a="A" xmlns:b="B"> ... <a:x>1</a:x> ... <b:x>2</b:x> ... <x>3</x> ... </a:root>""") root.b_x # fictitious ns-prefix-based lookup 2
If you now change one XML doc namespace prefix from xmls:b to xmlns:ns_b:
root = objectify.fromstring(""" ... <a:root xmlns:a="A" xmlns:ns_b="B"> ... <a:x>1</a:x> ... <ns_b:x>2</ns_b:x> ... <x>3</x> ... </a:root>""") root.b_x # fictitious ns-prefix-based lookup Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/lxml/objectify.pyx", line 231, in lxml.objectify.ObjectifiedElement.__getattr__ File "src/lxml/objectify.pyx", line 450, in lxml.objectify._lookupChildOrRaise AttributeError: no such child: b_x
Again, the very same code would suddenly cease to work, while the XML document remains semantically identical. You'd get an exception in the best case, or silently ignore data in the worst case. That aside: Volker wrote:
[...] Debugging becomes a great hassle if you are not able e.g. in your PyCharm IDE to navigate the XML tree your parser a currently processing. Even worse if some nodes do not seem to even exist. [...] It is not that I like a more convenient way to address the data. To address the data I use xpath. It is purely the fact that I cannot use the objectified data in a debugger while debugging, that drives me mad.
I admit I don’t fully understand the issue (I don't use PyCharm and don't know how it presents objects in debugging). To me, it seems easy enough to just do s.th. like
list(root.iterchildren()) [1, 2, 3]
or
print(objectify.dump(root)) # see also objectify.enable_recursive_str() {A}root = None [ObjectifiedElement] {A}x = 1 [IntElement] {B}x = 2 [IntElement] x = 3 [IntElement]
Does PyCharm use elem.__dict__ or dir(elem) to present an object's attributes in debugging? Then maybe a way to address OP's issue might be to populate elem.__dict__ not only with element children from the same namespace but with all children while *still* only attribute-lookup children from elem's namespace. I.e. instead of
root = objectify.fromstring(""" ... <a:root xmlns:a="A"> ... <a:x>1</a:x> ... <x>3</x> ... </a:root>""")
root.__dict__ {'x': 1}
__dict__ would yield
root.__dict__ # not how it works today! {'{A}x': 1, '{}x': 3}
...making all children appear in e.g. dir(), keeping existing getattr behavior:
root.a 1
Maybe this would lessen the "child visibility issue" in debugging? A breaking change of course, making __dict__ usage more surprising and arguably more "non-standard" compared to regular Python objects IMO, since they'd contain names that are not valid Python identifiers. A cursory glance over the implementation looks like this should be possible in theory. But I'm rather not convinced we should do this. Maybe the debugger/IDE can just be taught to give more helpful output? All the information is there in the first place... Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
![](https://secure.gravatar.com/avatar/5c0341ee25046f52f78418a8cf0c7717.jpg?s=120&d=mm&r=g)
Dear Holger! Am 03.03.22 um 15:38 schrieb Holger.Joukl@LBBW.de:
I admit I don’t fully understand the issue (I don't use PyCharm and don't know how it presents objects in debugging). To me, it seems easy enough to just do s.th. like
In PyCharm (and I think in any reasonable IDE these days) you have a full Python entity representation on any call-stack levels in the debugger https://backend.datenadler.de/kram/bildschirmfoto-vom-2022-03-03-17-39-44.pn... In this example you see (part of) the representation of an ISO19139 file in the PyCharm debugger. If you compare the source xml-file https://backend.datenadler.de/kram/31591bca-bb40-4d8a-98ad-35efc37524c9.xml/... with this image, you will notice that no gco:CharacterString Elements are shown. Because they are not in __dict__/dir().
Does PyCharm use elem.__dict__ or dir(elem) to present an object's attributes in debugging?
Exactly. To make this long discussion short. As I wrote earlier: Neither the current state of LXML nor my hackish version is really good. One solution real solution may be to include the namespace (represented via its prefix) in the property/Entity name. Therefore I am currently working on enabling LXML to have <prefix>_<name> properties in objectify. The changes are not too complicated since the source code quality is good. I am hopeful that after the weekend I will have full functional prototype. The __dict__ is already working but the class-representation has still some problems to be solved. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Dr. Volker Jaenisch schrieb am 03.03.22 um 18:19:
Therefore I am currently working on enabling LXML to have <prefix>_<name> properties in objectify. The changes are not too complicated since the source code quality is good. I am hopeful that after the weekend I will have full functional prototype.
As Holger wrote, the issue with prefixes is that they are provided by the input document. There are well-known prefixes for a hand full of namespaces, but that is a pure naming convention and in no way an obligation. While I can see that it might be helpful for debugging purposes to see that there are attributes like "html_image", no-one keeps them from ending up as "s_image" or just "image" (with a default namespace and no prefix), if the creator of the specific document at hand decides so. Aside from debugging, I fail to see a use case for this. And it increases the risk for innocent users to write code that seems to work with most documents (that use "standard" prefixes) but fail for others (which tend to be missing from the test suite). So … I think keeping prefixes generally out of the interface is a good decision. Stefan
![](https://secure.gravatar.com/avatar/5c0341ee25046f52f78418a8cf0c7717.jpg?s=120&d=mm&r=g)
Dear Stefan! Am 03.03.22 um 20:05 schrieb Stefan Behnel:
So … I think keeping prefixes generally out of the interface is a good decision.
I share your sorrows. Therefore I never even thought of changing the behavior of LXML - or even that of lxml.objectify. I will come up with lxml.objectify2 which does not change anything in lxml (or objectify) and is a pure addition. The user can decide to use lxml.objectify2 as tree_class when parsing.
While I can see that it might be helpful for debugging purposes to see that there are attributes like "html_image", no-one keeps them from ending up as "s_image" or just "image" (with a default namespace and no prefix), if the creator of the specific document at hand decides so. Absolutely! To be 100% sure to access the child you need the {URI}Name notation and there will never be a shortcut to that.
If we can agree on this, it is obious that it is no good idea to rely your code on python properties exposed by lxml.objectify, at all: * These properties are not representing all of the xml data. * There is a magical assumption, that a property is only visible if its namespace matches that of its parent. Sorry, but where is this supported in the RFCs of XML? * semantic changes can change the accessability visibility of properties. * Property names does not even show their namespace or prefix. I would never ever dare to base code on these brittle lxml.objectify python properties. Do you? OK let us assume for a moment that there are silly people out there basing there code on a magical design decision from 2006 in a particular python lib called lxml, that is based on no standard. So they are using lxml.objectify properties literal in their code aka "process(node.image)". Then we have the obligation to help these people. A week ago I was one of them. I was not basing my code but my debugging on lxml.objectify - but all the same. I like to make the debugging of lxml less harmfull for people like me. With lxml.objectify2 the code of such poor people, only relying on a property name (prone to semantic changes), can be supported with namespace prefixes, helping to gain more stability in closed contexts. ||The mapping of ns-prefix <-> ns-URI is already present in lxml.objectify in any node.|| |||| ||So in case of a xml file with || ||xmlns html : |||||http://www.w3.org/1999/xhtml| <https://www.w3.org/1999/xhtml> || |or an xml file with | ||xmlns s : |||||http://www.w3.org/1999/xhtml| <https://www.w3.org/1999/xhtml> || |there is in any case a stable (but for sure lokal/temporal) mapping between the prefix html|s and the html namespace URI)| |So the only thing that lxml.objectify2 does is mediating between different representations (clarke, <prefix>_<name>) of the same property.|| | | | |So if the user gets a property from an lxml.objectify2 entity that is "s_image" lxml.objectify2 can map this (for this particulars xml-file) to {||http://www.w3.org/1999/xhtml| <https://www.w3.org/1999/xhtml>|}image when talking to the etree api.| |If the user is of the overly optimistic kind they can use |||"s_image" literal in their code. This will fail in some cases (depending on the namepace mapping and context). But it will also fail if they use "image" (standard lxml) in cases of changed semantics.|| ||If we start a discussion here if changes in the namespace prefixes are more likely to happen than semantic changes, the whole world will laught at us. So I think we can agree that LXML should be resilent against both types of change. || || || To make lxml.objectify2 perfect I can add the option for an user to add a prefix-namespace mapping to lxml.objectify2. With this mapping any code can define stable prefixes to work with while being independent of the namespace prefixes of a given file. This is the same notion as for instance for node.xpath(namespaces={}) in lxml. ||To conlude my proposal:|| ||lxml.objectify2 is better:|| ||* since it is an addition that changes nothing at the current lxml/objectify || ||* since it shows (__dict__) all sub_nodes (lxml.objectify does not) || ||* since it shows also the namespace prefixes ||||||(lxml.objectify does not)|||| ||* since it allows for more possibilities to access/display a node|| || unqualified property name -> 'image' [unstable]|||| |||||||| prefix qualified property name -> 'html_image' [locally or in certain contexts stable||||] full qualified property name -> '{|||||||||||http://www.w3.org/1999/xhtm| <https://www.w3.org/1999/xhtml>|}image' |||||||||||[globally stable||||]|||| || || ||lxml.objectify2 is worse:|| ||<your comment>|| |Cheers,| |Volker | || | | | | | | |||||| || || -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Hi Volker, this reads like something you could implement on top of lxml.objectify, via subclassing and an appropriate element class lookup. This could really be a plain Python package that you could distribute on PyPI to give users an easy choice which interface they prefer. Not everything needs to be part of lxml itself. Stefan
![](https://secure.gravatar.com/avatar/5c0341ee25046f52f78418a8cf0c7717.jpg?s=120&d=mm&r=g)
Dear Stefan! Am 03.03.22 um 23:54 schrieb Stefan Behnel:
Hi Volker,
this reads like something you could implement on top of lxml.objectify, via subclassing and an appropriate element class lookup.
This could really be a plain Python package that you could distribute on PyPI to give users an easy choice which interface they prefer. Not everything needs to be part of lxml itself.
My prototype is still clued to lxml since I use internal cython functions of lxml that are not exported to python space. But with a little help of the kind lxml people it may be possible to completely seperate it from lxml. Cheers, Volker -- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Dr. Volker Jaenisch schrieb am 04.03.22 um 00:02:
Am 03.03.22 um 23:54 schrieb Stefan Behnel:
this reads like something you could implement on top of lxml.objectify, via subclassing and an appropriate element class lookup.
This could really be a plain Python package that you could distribute on PyPI to give users an easy choice which interface they prefer. Not everything needs to be part of lxml itself.
My prototype is still clued to lxml since I use internal cython functions of lxml that are not exported to python space. But with a little help of the kind lxml people it may be possible to completely seperate it from lxml.
The idea is to do pretty much what objectify currently does, using (I guess) the same element lookup, but to use a Python subclass of the ObjectifiedElement class for the tree structure that implements your different attribute lookup scheme in "__getattr__". The general mechanism for selecting element class implementations is described here: https://lxml.de/element_classes.html Stefan
participants (3)
-
Dr. Volker Jaenisch
-
Holger.Joukl@LBBW.de
-
Stefan Behnel