Building an lxml tree from C / Rust with html5ever
data:image/s3,"s3://crabby-images/3dabd/3dabd12fff40901b3d0c32b71a400f9c5b614dac" alt=""
Hi all, How would you recommend creating a lxml tree from C code? Or rather from Rust code, given that Rust code can call C functions and manipulate C structs, and Rust functions can be made to have a C-compatible ABI and be called from C. (I don’t mind adding some Cython code to the mix if it makes things easier.) I’ve seen http://lxml.de/capi.html and etreepublic.pxd, but it’s not clear to me if that API is suppose to be complete or if I should use libxml2’s API as well. Also, strings. Rust strings are UTF-8, and libxml2 seems to also be using UTF-8 internally. I’d rather not have everything go through Python strings and back, if possible. Any advice or links to docs or example code is appreciated. ## Background When parsing HTML in Python, there’s among others html5lib which is standards-compliant but relatively slow, and lxml.html which is very fast but uses libxml2’s old HTML 4 parser. Could we have the best of both? html5ever is a parser written for Servo per WHATWG’s HTML Standard. It’s written in Rust with performance in mind (e.g. by avoiding string copies as much as possible). The parser is separated from tree builders and tree data structures. https://github.com/servo/html5ever https://servo.org/ I’ve played a bit with using html5ever from Python through CFFI and writing tree builders in Python. Performance is somewhere in between html5lib and lxml.html. Even after some optimization, cProfile shows that more than half the time is spent in the tree builder, creating Python objects and going back and forth between languages. https://github.com/SimonSapin/html5ever-python https://github.com/SimonSapin/html5ever-python/tree/master/benchmarks I’m guessing that part of the reason lxml.html is so fast is that it doesn’t create Python object for each node during parsing, only on-demand during traversal. Perhaps a better benchmark would include a tree traversal in addition to parsing. I think the same approach could be good for html5ever-python: create libxml2 nodes in Rust / C / Cython without involving much Python code or many Python objects, and then create a lxml.etree.ElementTree object at the end. Thanks, -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin schrieb am 25.02.2016 um 18:12:
Almost. Using libxml2's own API can be a bit of a headache, but once you figured out which functions to call (and you can read lxml's code for that), you should get something up and working quite quickly. In order to wrap everything for lxml, however, you'd additionally need a _Document instance. The quickest way to get at one might be to create an Element through lxml's API and read its _doc attribute from C/Cython. Thus, I'd suggest starting from a new Python level root Element and filling in the libxml2 subtree by adding nodes to its _c_node from C/Cython code. The advantage of going that way is that lxml configures its libxml2 document, hash tables, etc. itself, you don't have to care about those details, and libxml2 can then use it the same way lxml would. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin schrieb am 25.02.2016 um 18:12:
Almost. Using libxml2's own API can be a bit of a headache, but once you figured out which functions to call (and you can read lxml's code for that), you should get something up and working quite quickly. In order to wrap everything for lxml, however, you'd additionally need a _Document instance. The quickest way to get at one might be to create an Element through lxml's API and read its _doc attribute from C/Cython. Thus, I'd suggest starting from a new Python level root Element and filling in the libxml2 subtree by adding nodes to its _c_node from C/Cython code. The advantage of going that way is that lxml configures its libxml2 document, hash tables, etc. itself, you don't have to care about those details, and libxml2 can then use it the same way lxml would. Stefan
participants (2)
-
Simon Sapin
-
Stefan Behnel