New version for PyPy, using cffi
Hello, I almost finished a port of lxml that uses cffi bindings instead of Cython. The goal is to run it with PyPy, which support for the CPython API is known to be very slow and somewhat incomplete. cffi on the other hand is much more friendly to PyPy's JIT. The code is there: https://github.com/amauryfa/lxml in the "cffi" branch. Note that it works directly from the source tree, no need to compile. You have to use a recent PyPy, ensure that the "lxml" directory can be seen from your PYTHONPATH, and that you have libxml development packages. The conversion -- from .pyx files to pure Python -- turned out to be quite straightforward. And the result is interesting IMO: - Impact on the existing code is minimal (in __init__.py), of course there is a lot of duplication. - Only 5 tests fail in the whole suite (if you except objectify, which is hard to support correctly on a non-refcounting gc) - Performance is on par with the version installed on my laptop, except from some catastrophic benchmarks, probably because of a missed optimization. I did not try to optimize anything, correctness first! My plan is to issue a pull request soon. Some work remains, though, and I need your help here: - adapt setup.py to the new system - fix the remaining tests - Is there a way to reduce the code duplication between .pyx and .py files? So please, try it with PyPy, try it with CPython (will be slower), comment on the code and report issues... Long live lxml! -- Amaury Forgeot d'Arc
Hi Amaury, first of all, thank you for the effort that you put into this. I'm not entirely happy about having a separate fork of the code base, but I'm sure (some?) PyPy users will appreciate it. Amaury Forgeot d'Arc, 14.06.2013 23:20:
I almost finished a port of lxml that uses cffi bindings instead of Cython. The goal is to run it with PyPy, which support for the CPython API is known to be very slow and somewhat incomplete. cffi on the other hand is much more friendly to PyPy's JIT.
The code is there: https://github.com/amauryfa/lxml in the "cffi" branch. Note that it works directly from the source tree, no need to compile. You have to use a recent PyPy, ensure that the "lxml" directory can be seen from your PYTHONPATH, and that you have libxml development packages.
The conversion -- from .pyx files to pure Python -- turned out to be quite straightforward.
Most of it, it seems. From what I gathered, there are at least some major adaptations that seem to have required some thought. The transformation is certainly nothing that could be automated.
And the result is interesting IMO:
- Impact on the existing code is minimal (in __init__.py), of course there is a lot of duplication.
Not sure what you mean. Which __init__.py where?
- Only 5 tests fail in the whole suite (if you except objectify, which is hard to support correctly on a non-refcounting gc)
Could you explain what the problem is here? lxml.objectify is certainly quite tightly coupled with lxml.etree, but it mainly uses standard features of the API, mainly custom element classes.
- Performance is on par with the version installed on my laptop, except from some catastrophic benchmarks, probably because of a missed optimization. I did not try to optimize anything, correctness first!
Sure.
My plan is to issue a pull request soon.
Is your goal to reintegrate this with mainline lxml? This will certainly require more work. And I wouldn't be happy about the additional maintenance overhead just to keep the additional cffi sources up-to-date.
Some work remains, though, and I need your help here: - adapt setup.py to the new system
Sounds like you'd better write a completely separate one. Most of what lxml is doing at install time is configuring the build. A cffi version shouldn't need that.
- fix the remaining tests - Is there a way to reduce the code duplication between .pyx and .py files?
Not easily. The Cython code merges everything into one module using includes, but it uses Cython syntax. Maybe submodules could be reused. Also, much of the implementation could be reduced to plain Python syntax, but that makes it more verbose. Maybe still less verbose than duplicated source files, but still too verbose to maintain IMHO. Plus, the functional changes for cffi would still need some major special casing. So, my current opinion on this is: I certainly won't do the work to merge it, and it's currently unclear to me what a good way to merge this would be. If it can be done without overly degrading maintenance, I'll consider it, but otherwise, having it in a fork sounds simpler for now. Stefan
2013/6/15 Stefan Behnel
Hi Amaury,
first of all, thank you for the effort that you put into this. I'm not entirely happy about having a separate fork of the code base, but I'm sure (some?) PyPy users will appreciate it.
Amaury Forgeot d'Arc, 14.06.2013 23:20:
I almost finished a port of lxml that uses cffi bindings instead of Cython. The goal is to run it with PyPy, which support for the CPython API is known to be very slow and somewhat incomplete. cffi on the other hand is much more friendly to PyPy's JIT.
The code is there: https://github.com/amauryfa/lxml in the "cffi" branch. Note that it works directly from the source tree, no need to compile. You have to use a recent PyPy, ensure that the "lxml" directory can be seen from your PYTHONPATH, and that you have libxml development packages.
The conversion -- from .pyx files to pure Python -- turned out to be quite straightforward.
Most of it, it seems. From what I gathered, there are at least some major adaptations that seem to have required some thought. The transformation is certainly nothing that could be automated.
Indeed. I meant that I did not have to change the application design: same module names, same functions, very similar code. Except maybe for the proxy module, and the FOR_EACH_FROM macros.
And the result is interesting IMO:
- Impact on the existing code is minimal (in __init__.py), of course there is a lot of duplication.
Not sure what you mean. Which __init__.py where?
The diff between the two branches: https://github.com/amauryfa/lxml/compare/master...cffi shows that I added many files, but modification to existing files are small. This means that the cffi version can coexist with the cython one.
- Only 5 tests fail in the whole suite (if you except objectify, which is hard to support correctly on a non-refcounting gc)
Could you explain what the problem is here? lxml.objectify is certainly quite tightly coupled with lxml.etree, but it mainly uses standard features of the API, mainly custom element classes.
Well, this issue is already mentioned in the lxml docs:
http://lxml.de/objectify.html#defining-additional-data-classes
"""Be aware, though, that this does not immediately apply to elements to
which there already is a Python reference. Their Python class will only be
changed after all references are gone and the Python object is garbage
collected.
"""
With a deterministic garbage collector, this is often not an issue. But
with a delayed gc, many tests fail.
For example:
from lxml import etree
from lxml import objectify
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.set_element_class_lookup(lookup)
root = parser.makeelement("root")
root.s = "test"
print type(root.s) #
- Performance is on par with the version installed on my laptop, except from some catastrophic benchmarks, probably because of a missed optimization. I did not try to optimize anything, correctness first!
Sure.
My plan is to issue a pull request soon.
Is your goal to reintegrate this with mainline lxml? This will certainly require more work. And I wouldn't be happy about the additional maintenance overhead just to keep the additional cffi sources up-to-date.
Yes, my goal is to merge it into lxml, either in the main branch, or in a separate branch. A separate fork is not too bad, until all the details are solved; but when it works, it should be hosted on the main lxml site. After all, it's not a separate project! redere Caesar... And it's probably easier for distribution tools to have only one PyPI entry.
Some work remains, though, and I need your help here: - adapt setup.py to the new system
Sounds like you'd better write a completely separate one. Most of what lxml is doing at install time is configuring the build. A cffi version shouldn't need that.
It probably needs a small block to declare the module as pure-python. And it's better if the run-time distribution does not need the header files. For this, it's necessary to import the modules once, and then redistribute the generated .so files.
- fix the remaining tests - Is there a way to reduce the code duplication between .pyx and .py files?
Not easily. The Cython code merges everything into one module using includes, but it uses Cython syntax. Maybe submodules could be reused. Also, much of the implementation could be reduced to plain Python syntax, but that makes it more verbose. Maybe still less verbose than duplicated source files, but still too verbose to maintain IMHO. Plus, the functional changes for cffi would still need some major special casing.
So, my current opinion on this is: I certainly won't do the work to merge it, and it's currently unclear to me what a good way to merge this would be. If it can be done without overly degrading maintenance, I'll consider it, but otherwise, having it in a fork sounds simpler for now.
Yes, I don't disagree. I have some experience on the maintenance of libraries though, and the maintenance work is of two kinds: - There must be unit tests for each new feature/bug fix on the main implementation. Otherwise the alternate implementation will completely overlook the change. I've seen that lxml development does this seriously, so this part of the maintenance is already part of the process. - Then, tests failing in the cffi implementation must be fixed. This is often a trivial task: port the changes from lxml/*.pyx to lxml-cffl/*.py. I understand it can become tedious, and for those cases a better way to share code is preferrable. Sometimes it's more complex though, because of differences in the underlying platform. But there even a more clever implementation won't help (remember all the bugs in cpyext...) For the near future I suggest the following: - I maintain this cffi version as a separate fork, fix the issues there, polish the remaining details and so on. - At some point, I hope to convice you and have the cffi version hosted on the main lxml repo, as a parallel branch if you prefer. -- Amaury Forgeot d'Arc
Amaury Forgeot d'Arc, 16.06.2013 22:34:
2013/6/15 Stefan Behnel
Amaury Forgeot d'Arc, 14.06.2013 23:20:
I almost finished a port of lxml that uses cffi bindings instead of Cython. The goal is to run it with PyPy, which support for the CPython API is known to be very slow and somewhat incomplete. cffi on the other hand is much more friendly to PyPy's JIT.
The code is there: https://github.com/amauryfa/lxml in the "cffi" branch. Note that it works directly from the source tree, no need to compile. You have to use a recent PyPy, ensure that the "lxml" directory can be seen from your PYTHONPATH, and that you have libxml development packages.
The conversion -- from .pyx files to pure Python -- turned out to be quite straightforward.
Most of it, it seems. From what I gathered, there are at least some major adaptations that seem to have required some thought. The transformation is certainly nothing that could be automated.
Indeed. I meant that I did not have to change the application design: same module names, same functions, very similar code. Except maybe for the proxy module, and the FOR_EACH_FROM macros.
Yes, those are very delicate. The proxy module (mis)uses several CPython internals and the whole purpose of the macros is speed. IIRC, they gave a performance boost of about 30% over the function based implementation and most of the places where they are used are very performance critical, often processing an entire tree in one tight loop.
And the result is interesting IMO: - Impact on the existing code is minimal (in __init__.py), of course there is a lot of duplication.
Not sure what you mean. Which __init__.py where?
The diff between the two branches: https://github.com/amauryfa/lxml/compare/master...cffi shows that I added many files, but modification to existing files are small.
This means that the cffi version can coexist with the cython one.
Ok. Plus, at least some parts of the Cython version could start using pure Python syntax, which would make the reuse easier. http://docs.cython.org/src/tutorial/pure.html apihelpers.pxi might get in the way, though, because it contains a couple of helper functions that need to get included in order to inline them. So it won't easily work to normally import it in PyPy and include it in Cython code.
http://lxml.de/objectify.html#defining-additional-data-classes """Be aware, though, that this does not immediately apply to elements to which there already is a Python reference. Their Python class will only be changed after all references are gone and the Python object is garbage collected. """
I see. This should rarely be a problem in practice, but it might bite some users quite unexpectedly. But then again, any PyPy user will have to become aware of this difference at some point.
my goal is to merge it into lxml, either in the main branch, or in a separate branch. A separate fork is not too bad, until all the details are solved; but when it works, it should be hosted on the main lxml site. After all, it's not a separate project! redere Caesar... And it's probably easier for distribution tools to have only one PyPI entry.
I guess so.
Some work remains, though, and I need your help here: - adapt setup.py to the new system
Sounds like you'd better write a completely separate one. Most of what lxml is doing at install time is configuring the build. A cffi version shouldn't need that.
It probably needs a small block to declare the module as pure-python.
Sure. I meant that the setup.py script and its related modules are already quite involved and confusing. Adding to that for something simple but completely separate (a different runtime!) would only make it worse. Hmm, this makes it appear like the actual problem is a deeper one, not so much related to this addition per se.
And it's better if the run-time distribution does not need the header files. For this, it's necessary to import the modules once, and then redistribute the generated .so files.
I've had very bad experience with binary distributions. They tend to work only for some users on MacOS-X, they are required and tend to work for Windows, but they always drop on your feet on Linux. The wheel format might help here, but then, it would require more than one distribution already...
- There must be unit tests for each new feature/bug fix on the main implementation. Otherwise the alternate implementation will completely overlook the change. I've seen that lxml development does this seriously, so this part of the maintenance is already part of the process.
Incidentally, most of the (overall not so many) missing tests are for the plain ElementTree API and also some surprisingly old features. New features tend to be better backed by tests and it's worth being strict for pull requests.
- Then, tests failing in the cffi implementation must be fixed. This is often a trivial task: port the changes from lxml/*.pyx to lxml-cffl/*.py. I understand it can become tedious, and for those cases a better way to share code is preferrable. Sometimes it's more complex though, because of differences in the underlying platform. But there even a more clever implementation won't help (remember all the bugs in cpyext...)
For the near future I suggest the following: - I maintain this cffi version as a separate fork, fix the issues there, polish the remaining details and so on. - At some point, I hope to convice you and have the cffi version hosted on the main lxml repo, as a parallel branch if you prefer.
If we can reduce the duplication to some (hopefully major) extent, I'd almost prefer having everything in the master branch. But that's talking eggs while you're raising the chicken. Stefan
On Fri, Jun 14, 2013 at 10:20 PM, Amaury Forgeot d'Arc
Hello,
I almost finished a port of lxml that uses cffi bindings instead of Cython. The goal is to run it with PyPy, which support for the CPython API is known to be very slow and somewhat incomplete. cffi on the other hand is much more friendly to PyPy's JIT.
The code is there: https://github.com/amauryfa/lxml in the "cffi" branch. Note that it works directly from the source tree, no need to compile. You have to use a recent PyPy, ensure that the "lxml" directory can be seen from your PYTHONPATH, and that you have libxml development packages.
The conversion -- from .pyx files to pure Python -- turned out to be quite straightforward. And the result is interesting IMO:
- Impact on the existing code is minimal (in __init__.py), of course there is a lot of duplication.
- Only 5 tests fail in the whole suite (if you except objectify, which is hard to support correctly on a non-refcounting gc)
- Performance is on par with the version installed on my laptop, except from some catastrophic benchmarks, probably because of a missed optimization. I did not try to optimize anything, correctness first!
My plan is to issue a pull request soon. Some work remains, though, and I need your help here: - adapt setup.py to the new system - fix the remaining tests - Is there a way to reduce the code duplication between .pyx and .py files?
So please, try it with PyPy, try it with CPython (will be slower), comment on the code and report issues... Long live lxml!
Hi Amaury. Well done on the work. Here is the result of my first try:
from lxml import etree Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml/../lxml-cffi/etree.py", line 2859, in <module> from .xmlschema import XMLSchema, XMLSchemaParseError File "lxml/../lxml-cffi/xmlschema.py", line 34, in <module> namespaces={u'xs': u'http://www.w3.org/2001/XMLSchema'}) File "lxml/../lxml-cffi/xpath.py", line 379, in __init__ self.set_context(xpathCtxt) File "lxml/../lxml-cffi/xpath.py", line 155, in set_context self._context.set_context(xpathCtxt) File "lxml/../lxml-cffi/xpath.py", line 90, in set_context self._set_xpath_context(xpathCtxt) File "lxml/../lxml-cffi/extensions.py", line 111, in _set_xpath_context handle = xpath.ffi.new_handle(self) AttributeError: 'FFI' object has no attribute 'new_handle'
This is with: Mageia 2 pypy-2.0.2 built from source lib64xml2-devel-2.7.8 lib64xslt-devel-1.1.26 Please say if further information would help. Keep well Friedel
2013/6/15 F Wolff
AttributeError: 'FFI' object has no attribute 'new_handle'
Oh, you need a newer version of cffi. I did not realized that "new_handle" was not present in 2.0.2 :-( Try with a (recent) nightly build of pypy, it has it for sure. -- Amaury Forgeot d'Arc
participants (3)
-
Amaury Forgeot d'Arc
-
F Wolff
-
Stefan Behnel