[lxml-dev] wrapping libxml2 with ctypes

I'm interested in wrapping (parts of) libxml2 with ctypes. Not because I have a use case for it right now, but because I need a larger project to drive ctypes to the limit, and to get more experience with it. As I understand it, there are already Python bindings using pyrex to parts of libxml - it would also be very interesting to compare the speed of the pyrex bindings to ctypes bindings. So I'm interested in answers to these questions: Are the bindings stable? Is the python api documented? Are there unittests, and what do they cover? Thanks, Thomas

Thomas Heller wrote:
I'm interested in wrapping (parts of) libxml2 with ctypes.
Cool!
There are definitely interesting challenges in wrapping lixml2.
The goal of the lxml bindings is also to lift the level at which libxml2 used significantly, make it more Pythonic.
So I'm interested in answers to these questions:
Are the bindings stable?
No; there are still complicated bugs in the proxying approach and trying to do automatic memory management, a major goal of lxml.
Is the python api documented?
There are currently two APIs on top of libxml2 -- DOM (most read-only parts of DOM Core level 3), and ElementTree (partially). Both are documented but not by me. :)
Are there unittests, and what do they cover?
They cover the DOM API and the ElementTree APIs as far as they've been implemented. I'd like to figure out a way to work together, though I'm not sure how. One way to do it is to both implement the same API and compare the code. If ctypes is significantly easier, we may want to switch to that. If we for some reason want both, we could perhaps look into reusing parts of the codebase. Regards, Martijn

Martijn Faassen <faassen@infrae.com> writes:
Ok, I took the etree.pyx file and hacked on it. Most of the translation from pyrex to ctypes python was pretty straightforward: - it is not possible to compare against NULL: if c_node is NULL: return # return None must be replaced by: if not c_node: return # return None - pyrex derefences pointers by the . operator, because Python doesn't have a -> operator: c_node = self._c_node.children In ctypes, one has to write this (which would also work in C, but nobody uses it (for good reasons): c_node = self._c_node[0].children - and then translating all the pyrex specific syntax back into Python again. Finally, I had to replace the first part of the etree.pyx file, which contains the declarations of functions and structures by the ctypes counterpart. This is an area I currently work on - most of this stuff can now be generated automatically by gccxml (which parses C header files into xml), and a parser/code-generator combo I've written that creates ctypes structure definitions, enum definitions, and function decorators. The next layer had to be written manually: simple Python functions, that get decorated by the decorators, to expose the libxml2 api at a slightly higher level. For example, the xmlDocDumpMemory function in this layer only receives the doc as only parameter, and finally returns a Python string containing the result (freeing up the memory is done in the xmlDocDumpMemory function body itself). I have to admit that I had to write a couple of functions which are currently missing in ctypes, but should be in there - that's my main reason for doing all this at the moment. Ok, the resulting Python module passes quite some tests from test_etree.py (I get some failures in the test_attributes* methods, but that may be because I did something wrong). But I noticed that the tests only work if I run only a few of them - more than 3 or 4 give access violations pretty soon (well, ctypes catches them and converts them into Python exceptions, although this is a windows only feature). Are these problems similar to the memory management issues you mention above? Do you observe a similar behaviour when using the pyrex wrapper? I'm not sure ctypes is easier than pyrex (I never used it) - probably it is much more difficult to find bugs in ctypes itself, but it should be easier to find bugs in the wrapper code, imo. It will be slower than pyrex, otoh it doesn't require compilation. Final remark: Most of this stuff is not even in ctypes CVS, far from being in the released version. If you want to take a look at the module, I can post it here, it is slightly smaller than the pyx file (14469 vs. 15775 bytes). Thomas

Hey Thomas, Sorry I haven't found the time to reply yet; I'm rather busy preparing this week and then will be gone for 3 weeks after that from the office, but that doesn't mean I'm not very interested in what you've been up to. I'll try to find the time to reply in more detail later, but in the mean time please don't stop whatever you're doing! :) Regards, Martijn

Thomas Heller wrote:
I'm interested in wrapping (parts of) libxml2 with ctypes.
Cool!
There are definitely interesting challenges in wrapping lixml2.
The goal of the lxml bindings is also to lift the level at which libxml2 used significantly, make it more Pythonic.
So I'm interested in answers to these questions:
Are the bindings stable?
No; there are still complicated bugs in the proxying approach and trying to do automatic memory management, a major goal of lxml.
Is the python api documented?
There are currently two APIs on top of libxml2 -- DOM (most read-only parts of DOM Core level 3), and ElementTree (partially). Both are documented but not by me. :)
Are there unittests, and what do they cover?
They cover the DOM API and the ElementTree APIs as far as they've been implemented. I'd like to figure out a way to work together, though I'm not sure how. One way to do it is to both implement the same API and compare the code. If ctypes is significantly easier, we may want to switch to that. If we for some reason want both, we could perhaps look into reusing parts of the codebase. Regards, Martijn

Martijn Faassen <faassen@infrae.com> writes:
Ok, I took the etree.pyx file and hacked on it. Most of the translation from pyrex to ctypes python was pretty straightforward: - it is not possible to compare against NULL: if c_node is NULL: return # return None must be replaced by: if not c_node: return # return None - pyrex derefences pointers by the . operator, because Python doesn't have a -> operator: c_node = self._c_node.children In ctypes, one has to write this (which would also work in C, but nobody uses it (for good reasons): c_node = self._c_node[0].children - and then translating all the pyrex specific syntax back into Python again. Finally, I had to replace the first part of the etree.pyx file, which contains the declarations of functions and structures by the ctypes counterpart. This is an area I currently work on - most of this stuff can now be generated automatically by gccxml (which parses C header files into xml), and a parser/code-generator combo I've written that creates ctypes structure definitions, enum definitions, and function decorators. The next layer had to be written manually: simple Python functions, that get decorated by the decorators, to expose the libxml2 api at a slightly higher level. For example, the xmlDocDumpMemory function in this layer only receives the doc as only parameter, and finally returns a Python string containing the result (freeing up the memory is done in the xmlDocDumpMemory function body itself). I have to admit that I had to write a couple of functions which are currently missing in ctypes, but should be in there - that's my main reason for doing all this at the moment. Ok, the resulting Python module passes quite some tests from test_etree.py (I get some failures in the test_attributes* methods, but that may be because I did something wrong). But I noticed that the tests only work if I run only a few of them - more than 3 or 4 give access violations pretty soon (well, ctypes catches them and converts them into Python exceptions, although this is a windows only feature). Are these problems similar to the memory management issues you mention above? Do you observe a similar behaviour when using the pyrex wrapper? I'm not sure ctypes is easier than pyrex (I never used it) - probably it is much more difficult to find bugs in ctypes itself, but it should be easier to find bugs in the wrapper code, imo. It will be slower than pyrex, otoh it doesn't require compilation. Final remark: Most of this stuff is not even in ctypes CVS, far from being in the released version. If you want to take a look at the module, I can post it here, it is slightly smaller than the pyx file (14469 vs. 15775 bytes). Thomas

Hey Thomas, Sorry I haven't found the time to reply yet; I'm rather busy preparing this week and then will be gone for 3 weeks after that from the office, but that doesn't mean I'm not very interested in what you've been up to. I'll try to find the time to reply in more detail later, but in the mean time please don't stop whatever you're doing! :) Regards, Martijn
participants (2)
-
Martijn Faassen
-
Thomas Heller