Hi! Vostretsov Nikita schrieb am 26.05.21 um 11:35:
Let me describe a situation. We have a lot of code working with lxml.html.HtmlElement. Now we want to support HTML5. html5lib is to slow for our requirements. Other libraries works with C language for best performance. E.g: - https://github.com/kovidgoyal/html5-parser/blob/master/src/as-libxml.c (gumbo) - http://source.netsurf-browser.org/libhubbub.git/tree/examples/libxml.c (libhubbub) - https://github.com/SimonSapin/html5ever-python/blob/master/html5ever/element... (html5ever) Converting(https://github.com/whalebot-helmsman/html5-parser/blob/lxml-html/src/html5_p...) python structures by copying attributes, text and tail from lxml.etree._Element to lxml.html.HtmlElement also slower than our current HTML4 code (~20%) As I understand there is no difference between HTML and XML in C language. lxml.html.HtmlElement is a python structure. Is it possible to have lxml.html.HtmlElement on top of lxml.etree._Element without copying(performance drop)? May be I am missing other possibility?
lxml.html simply registers the HtmlElement with the parser and makes that return these as Python object representation. It's explained here: https://lxml.de/element_classes.html The parsers that you linked above may not support this directly (although there is no reason why they couldn't), but as soon as you have an lxml tree in your hands, you can replace the element class lookup that it uses. That way, you can keep the tree but change its Python interface. Note that you need to take special care of the (root) element returned by the parser. As long as you have a Python reference to it, it won't change its Python class interface. You may be able to work around that by 1) changing the element lookup scheme 2) getting hold of some child element reference (should use the new class) 3) deleting the reference of the root element (make sure it's the last one) 4) using something like child.getroottree() or .getparent() to get a new reference to the root node Stefan