Compatibility issues between `lxml.etree.set_element_class_lookup`` and `lxml.html`
Hello, Sorry for the bother, but I've been looking at `lxml.etree.set_element_class_lookup`[0] as a way to add validation and features to lxml usage without having to ban "standard" lxml constructs (and to control usage by dependencies as well). However while it seems to work fine for etree itself it interacts quite badly with both lxml.html and objectify:
from lxml import etree, html, objectify class Custom(etree.ElementBase): ... ... ...
etree.set_element_class_lookup(etree.ElementDefaultClassLookup(element=Custom))
print(type(html.fromstring('<a>')))
<class '__main__.Custom'>
print(type(objectify.fromstring('<a/>'))) <class '__main__.Custom'>
As can be seen here, both the html parser and the objectify parser seem to complete lose their "magic". Is there a "proper" way to make these things collaborate? I looked at lxml.html and it looked like it might have to be rebuilt from the HTMLMixin (which already seems icky) but `objectify` is a cython module so there doesn't seem to be a good way to interact with it. [0] https://lxml.de/3.1/api/public/lxml.etree-module.html#set_element_class_look...
Hi Xavier, Am 07.03.2022 13:27 schrieb Xavier Morel:
As can be seen here, both the html parser and the objectify parser seem to complete lose their "magic".
Is there a "proper" way to make these things collaborate?
I've recently wondered the same and started experimenting on how to combine custom class lookup and lxml.objectivy, see this gist [0]. It doesn't use ElementDefaultClassLookup but ElementNamespaceClassLookup, but maybe it's still relevant. The essence to make it work was - set objectify.ObjectifyElementClassLookup() as fallback in our custom lookup scheme, so that tree nodes get "objectivied" (L. 56) - create new elements via custom parser, because only the parsers knows about the custom lookup scheme (L. 94, 98) I'm lxml newbie, maybe some more experienced folks here can review the code and tell if it's a "proper way"? Would appreciate some feedback. Cheers Tobias [0] https://gist.github.com/haxtibal/8bfc6d32915c7c434e1bc91d016de916
On 3/7/22 22:29, Tobias Deiminger wrote:
Am 07.03.2022 13:27 schrieb Xavier Morel:
As can be seen here, both the html parser and the objectify parser seem to complete lose their "magic".
Is there a "proper" way to make these things collaborate?
I've recently wondered the same and started experimenting on how to combine custom class lookup and lxml.objectivy, see this gist [0]. It doesn't use ElementDefaultClassLookup but ElementNamespaceClassLookup, but maybe it's still relevant.
The essence to make it work was - set objectify.ObjectifyElementClassLookup() as fallback in our custom lookup scheme, so that tree nodes get "objectivied" (L. 56) - create new elements via custom parser, because only the parsers knows about the custom lookup scheme (L. 94, 98)
I'm lxml newbie, maybe some more experienced folks here can review the code and tell if it's a "proper way"? Would appreciate some feedback.
Interesting take. I think it doesn't work for my case because neither ElementDefaultClassLookup nor ObjectifyElementClassLookup support fallback (which makes sense they always match everything), and if they did `ObjectifyElementClassLookup` would still return `lxml.objectify.ObjectifiedElement` built upon the original `lxml.etree.ElementBase` rather than those I want as new bases. I guess the way forward for me is to re-create the relevant subclasses (`etree.html` has a mixin, `lxml.objectify` doesn't but getting the inheritance diamond right should be feasible), then use the ElementClassLookup objects provided by each submodule to update or replace the default parsers. Bit fiddly and I don't know whether all utility functions are cooperative (e.g. lxml.objectify.makeparser) but it seems like a workable plan at least unless a maintainer or experienced user knows about issues with that idea. Thanks.
Salut, Xavier Morel schrieb am 07.03.22 um 13:27:
Sorry for the bother, but I've been looking at `lxml.etree.set_element_class_lookup`[0] as a way to add validation and features to lxml usage without having to ban "standard" lxml constructs (and to control usage by dependencies as well).
I consider the function fine for what it does, but if it gets in the way, don't use it. It's a global setting, which means that it can break stuff elsewhere, unintentionally and without warning. Just create your own parser instance and configure the class lookup only there.
Is there a "proper" way to make these things collaborate? I looked at lxml.html and it looked like it might have to be rebuilt from the HTMLMixin (which already seems icky) but `objectify` is a cython module so there doesn't seem to be a good way to interact with it.
Cython modules are mostly just compiled Python mpdules and behave pretty much the same, from a user perspective. If you can read Python, you can probably read Cython code, and if you know how to use Python modules, you can probably also work with Cython compiled modules. Stefan
On 3/8/22 12:02, Stefan Behnel wrote:
Xavier Morel schrieb am 07.03.22 um 13:27:
Sorry for the bother, but I've been looking at `lxml.etree.set_element_class_lookup`[0] as a way to add validation and features to lxml usage without having to ban "standard" lxml constructs (and to control usage by dependencies as well).
I consider the function fine for what it does, but if it gets in the way, don't use it. It's a global setting, which means that it can break stuff elsewhere, unintentionally and without warning.
Just create your own parser instance and configure the class lookup only there.
The bother was mostly for such things as dependencies where it can be easy to miss, or possibly impossible to configure entirely. `etree.set_default_parser` also exists but I think has the same concern when it comes to the interaction with the submodules composed over ElementBase.
Is there a "proper" way to make these things collaborate? I looked at lxml.html and it looked like it might have to be rebuilt from the HTMLMixin (which already seems icky) but `objectify` is a cython module so there doesn't seem to be a good way to interact with it.
Cython modules are mostly just compiled Python mpdules and behave pretty much the same, from a user perspective. If you can read Python, you can probably read Cython code, and if you know how to use Python modules, you can probably also work with Cython compiled modules.
Oh yeah reading the code and understanding the "normal" API is not an issue, the code is quite readable and well documented. The issue is mostly how to do do the weirder things e.g. if I plug in extensions via set_element_class_lookup and a dependency has decided to use objectify for some reason then it might break entirely because `objectify.fromstring` now returns the custom element, but if I `etree.set_default_parser` then the objectified elements would not have the extensions right? And importantly how to do these weirder things at least somewhat cleanly from lxml's perspective e.g. don't rely too much on unsupported implementation details. My understanding so far (see other subthread) would be that I'd hook into `etree` itself via either `set_element_class_lookup` or (if that doesn't work) `set_default_parser`, re-do the Objectified stuff (inherit the right objects in the right order) and set it via `lxml.objectify.set_default_parser`, but `lxml.html` doesn't have such a function, would directly setting `lxml.html.html_parser` be the right thing to do then? They are listed in the API doc but strictly speaking they're not documented as updateable default parser instances.
participants (3)
-
Stefan Behnel
-
Tobias Deiminger
-
Xavier Morel