Mailman 3 Re: python lxml.objectify gives no attribute access to gco:CharacterString node - lxml - The Python XML Toolkit

4 Mar 2022

      ...
Absolutely! To be 100% sure to access the child you need the {URI}Name notation and there will never be a shortcut to that.
If we can agree on this, it is obious that it is no good idea to rely your code on python properties exposed by lxml.objectify, at all:
* These properties are not representing all of the xml data.
* There is a magical assumption, that a property is only visible if its namespace matches that of its parent. Sorry, but where is this supported in the RFCs of XML?
Where do the W3C XML Recommendations say anything about how an XML data handling library's API should
behave?
...
* semantic changes can change the accessability visibility of properties.
* Property names does not even show their namespace or prefix.
I would never ever dare to base code on these brittle lxml.objectify python properties. Do you?
I don't consider them brittle at all. How attribute lookup works is well-defined. Semantic changes (e.g. an element changes its
namespace, not merely a prefix) means you deal with a whole other element (qualified name changed).
This requires you to rewrite child-accessing code anyway, unless you want to access all elements, regardless of
semantic meaning. But then, you could just as well user iterchildren() or iter().
...
OK let us assume for a moment that there are silly people out there basing there code on a magical design decision from 2006 in
a particular python lib called lxml, that is based on no standard. So they are using lxml.objectify properties literal in their code aka
"process(node.image)".
There was no "magical design decision". It was a deliberate design choice that elem.<attr> should lookup on ns-unqualified tag attr names
from elem's namespace, not from other namespaces.

But you can just as easily lookup qualified using getattr(elem, '{/other/ns'attr') or elem['{/other/ns'attr].
...
Then we have the obligation to help these people. A week ago I was one of them. I was not basing my code but my debugging on lxml.objectify - but all the same.
I like to make the debugging of lxml less harmfull for people like me. With lxml.objectify2 the code of such poor people, only relying on a property name
(prone to semantic changes), can be supported with namespace prefixes, helping to gain more stability in closed contexts.
But what's the use case here: Iterate over elem.__dict__ and then getattr(elem, key) for every key?
Simply use elem.iterchildren() instead.

I understand the debugging inconvenience (and proposed a possible mitigation for it, though
I'm unsure if this should be actually done because of littering __dict__ and dir() with
names that are not valid Python identifiers, which is uncommon).
...
The mapping of ns-prefix <-> ns-URI is already present in lxml.objectify in any node.
So in case of a xml file with
xmlns html : https://www.w3.org/1999/xhtml
or an xml file with
xmlns s : https://www.w3.org/1999/xhtml
there is in any case a stable (but for sure lokal/temporal) mapping between the prefix html|s and the html namespace URI)
So the only thing that lxml.objectify2 does is mediating between different representations (clarke, <prefix>_<name>) of the same property.
So if the user gets a property from an lxml.objectify2 entity that is "s_image" lxml.objectify2 can map this (for this particulars xml-file)
to {https://www.w3.org/1999/xhtml}image when talking to the etree api.
If the user is of the overly optimistic kind they can use "s_image" literal in their code. This will fail in some cases (depending on the namepace
mapping and context). But it will also fail if they use "image" (standard lxml) in cases of changed semantics.
If we start a discussion here if changes in the namespace prefixes are more likely to happen than semantic changes,
the whole world will laught at us. So I think we can agree that LXML should be resilent against both types of change.
So then what's the point of using elem.s_image in my code when this may suddenly cease to work if I could simply
use getattr(elem, '{https://www.w3.org/1999/xhtml}attr') instead and be safe?

IMHO You can't be resilient against those changes: If the meaning of a variable "color" (in XML terms: "{/my/colors}color")
changes you'll have to change your code. Because color now suddenly contains sounds, not colors :-).
And if its name changes to "my_shiny_color" (in XML terms "{/my/shiny/colors}color") you'll have to change
your code, because now the name "color" is undefined.
...
To make lxml.objectify2 perfect I can add the option for an user to add a prefix-namespace mapping to lxml.objectify2. With this mapping any code can define stable prefixes to work with while being independent of the namespace prefixes of a given file. This is the same notion
as for instance for node.xpath(namespaces={}) in lxml.
Which you'd have to hand in when parsing? Because you can't hand it in for elem.attr syntax or getattr(elem, 'attr'),
unless you'd  want to override Python's built-in getattr.
...
To conlude my proposal:
lxml.objectify2 is better:
* since it is an addition that changes nothing at the current lxml/objectify
* since it shows (__dict__) all sub_nodes (lxml.objectify does not)
* since it shows also the namespace prefixes (lxml.objectify does not)
* since it allows for more possibilities to access/display a node
   unqualified property name -> 'image' [unstable]
   prefix qualified property name -> 'html_image' [locally or in certain contexts stable]
   full qualified property name -> '{https://www.w3.org/1999/xhtml}image' [globally stable]
Do you propose another way to access with fully qualified property name  apart from getattr() / indexed access?
As it stands, this is already available and your other 2 methods are unstable (so unusable, in my book).
And I fail to imagine s.th. like elem.https_www_w3_org_1999_xhtml_image as  a desirable API.

So to me, the proposal doesn’t add substantial gain apart from the debugging visibility but rather
adds ambiguity.
Maybe the debugging inconvenience could be addressed in lxml.objectify, as I mentioned. Maybe
you could simply implement it in pure Python on top of objectify, as Stefan suggested.
Or maybe teach your IDE to better support your debugging needs.

One last thing I'd respectfully ask you to:
...
I would never ever dare to base code on these brittle lxml.objectify python properties. Do you?
[...] silly people out there basing there code on a magical design decision from 2006 in a particular python lib called lxml,
that is based on no standard.
Then we have the obligation to help these people
I like to make the debugging of lxml less harmfull for people like me. With lxml.objectify2 the code of such poor people, only relying on a property name
If we start a discussion here if changes in the namespace prefixes are more likely to happen than semantic changes, the whole world will laught at us
Please tone down your wording ("brittle lxml.objectify properties", "silly people basing there code on a magical design decision from 2006",
"based on no standard", "obligation to help these people", "make the debugging of lxml less harmful", "poor people",
"the whole world will laught at us").
It doesn't help.

Holger

Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz

Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Re: python lxml.objectify gives no attribute access to gco:CharacterString node

Holger.Joukl＠LBBW.de

tags

participants (1)