Inclusion of lxml-cffi into lxml

Hi What it would take to include lxml-cffi (https://github.com/amauryfa/lxml/tree/cffi) as an official part of lxml? It works better on PyPy (with the original lxml being slow and prone to bugs, notably segfaulting for me) Cheers, fijal

Maciej Fijalkowski schrieb am 11.03.2015 um 15:28:
The actual functional differences aren't all that big AFAICT, but the problem is that the syntactic changes in the cffi based modules are scattered all over the place. That prevents a direct integration of the two in one code base without an almost complete duplication of the code. OTOH, as long as someone has to maintain the cffi based code separately anyway, I don't see the advantage of having it in the same repo as lxml itself.
It works better on PyPy (with the original lxml being slow and prone to bugs, notably segfaulting for me)
Yes, cpyext is still buggy and incomplete in some corners and far from optimised. I'm sure PyPy can be improved here, though. Stefan

On Wed, Mar 18, 2015 at 5:20 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
The main benefit is really the release cycle + "officializing" the lxml-cffi in a way that people who install lxml can directly use it with pypy without having to look for yet another module. If I put all the cffi changes in one spot can this be done?
I know you believe it's just a matter of engineering, but we have to agree to disagree a bit. Yes, cpyext CAN be made to be a little better and perform a little better, but there is just no way it can ever get anywhere close to what we can get with cffi.

Maciej Fijalkowski schrieb am 18.03.2015 um 16:23:
Merging the current fork as it stands would drop some 16K lines of almost entirely redundant code on lxml, with tiny differences every couple of lines, mostly syntactic, some functional. I don't see how to reduce this to a level that does not impact the maintainability in the future. Regardless, I sporadically put some effort into simplifying older parts of lxml's code base to get rid of "Cythonisms" in favour of straight Python code that Cython optimises into similarly performing C code internally these days. That's not going to help much for the current 16K of problems, but it should at least reduce the amount of useless differences over time.
Why should that be the goal? If it runs safely without crashing, it's already much better than the current situation. The really performance critical stuff in lxml runs in highly tuned C, no matter if it's called by CPython or PyPy. The large cpyext overhead simply means that code that runs in PyPy has to take more care than in CPython to avoid doing too much work in Python code. It's not like that's an unusual pattern for Python developers. I (and others) already invested some time into making Cython work around gaps, bugs and design differences in cpyext (e.g. reduce number of C-API calls or avoid borrowed references), and I keep doing that. lxml benefits from it, to the point that most of its functionality just works. With a little more effort, I think we can get rid of the remaining crashes in lxml as well. Stefan

Maciej Fijalkowski schrieb am 11.03.2015 um 15:28:
The actual functional differences aren't all that big AFAICT, but the problem is that the syntactic changes in the cffi based modules are scattered all over the place. That prevents a direct integration of the two in one code base without an almost complete duplication of the code. OTOH, as long as someone has to maintain the cffi based code separately anyway, I don't see the advantage of having it in the same repo as lxml itself.
It works better on PyPy (with the original lxml being slow and prone to bugs, notably segfaulting for me)
Yes, cpyext is still buggy and incomplete in some corners and far from optimised. I'm sure PyPy can be improved here, though. Stefan

On Wed, Mar 18, 2015 at 5:20 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
The main benefit is really the release cycle + "officializing" the lxml-cffi in a way that people who install lxml can directly use it with pypy without having to look for yet another module. If I put all the cffi changes in one spot can this be done?
I know you believe it's just a matter of engineering, but we have to agree to disagree a bit. Yes, cpyext CAN be made to be a little better and perform a little better, but there is just no way it can ever get anywhere close to what we can get with cffi.

Maciej Fijalkowski schrieb am 18.03.2015 um 16:23:
Merging the current fork as it stands would drop some 16K lines of almost entirely redundant code on lxml, with tiny differences every couple of lines, mostly syntactic, some functional. I don't see how to reduce this to a level that does not impact the maintainability in the future. Regardless, I sporadically put some effort into simplifying older parts of lxml's code base to get rid of "Cythonisms" in favour of straight Python code that Cython optimises into similarly performing C code internally these days. That's not going to help much for the current 16K of problems, but it should at least reduce the amount of useless differences over time.
Why should that be the goal? If it runs safely without crashing, it's already much better than the current situation. The really performance critical stuff in lxml runs in highly tuned C, no matter if it's called by CPython or PyPy. The large cpyext overhead simply means that code that runs in PyPy has to take more care than in CPython to avoid doing too much work in Python code. It's not like that's an unusual pattern for Python developers. I (and others) already invested some time into making Cython work around gaps, bugs and design differences in cpyext (e.g. reduce number of C-API calls or avoid borrowed references), and I keep doing that. lxml benefits from it, to the point that most of its functionality just works. With a little more effort, I think we can get rid of the remaining crashes in lxml as well. Stefan
participants (2)
-
Maciej Fijalkowski
-
Stefan Behnel