[lxml] Re: HtmlMixin for lxml.etree._Element

26 May 2021

      Hi!

Vostretsov Nikita schrieb am 26.05.21 um 11:35:
...
Let me describe a situation.
We have a lot of code working with lxml.html.HtmlElement. Now we want to support HTML5. html5lib is to slow for our requirements. Other libraries works with C language for best performance. 
E.g:
- https://github.com/kovidgoyal/html5-parser/blob/master/src/as-libxml.c (gumbo)
- http://source.netsurf-browser.org/libhubbub.git/tree/examples/libxml.c (libhubbub)
- https://github.com/SimonSapin/html5ever-python/blob/master/html5ever/element... (html5ever)
Converting(https://github.com/whalebot-helmsman/html5-parser/blob/lxml-html/src/html5_p...) python structures by copying attributes, text and tail from lxml.etree._Element to lxml.html.HtmlElement also slower than our current HTML4 code (~20%)
As I understand there is no difference between HTML and XML in C language. lxml.html.HtmlElement is a python structure. 
Is it possible to have  lxml.html.HtmlElement on top of lxml.etree._Element without copying(performance drop)?
May be I am missing other possibility?
lxml.html simply registers the HtmlElement with the parser and makes that
return these as Python object representation. It's explained here:

https://lxml.de/element_classes.html

The parsers that you linked above may not support this directly (although
there is no reason why they couldn't), but as soon as you have an lxml tree
in your hands, you can replace the element class lookup that it uses. That
way, you can keep the tree but change its Python interface.

Note that you need to take special care of the (root) element returned by
the parser. As long as you have a Python reference to it, it won't change
its Python class interface. You may be able to work around that by

1) changing the element lookup scheme
2) getting hold of some child element reference (should use the new class)
3) deleting the reference of the root element (make sure it's the last one)
4) using something like child.getroottree() or .getparent() to get a new
reference to the root node

Stefan

[lxml] Re: HtmlMixin for lxml.etree._Element

Stefan Behnel