2013/6/15 Stefan Behnel <stefan_ml@behnel.de>
Hi Amaury,

first of all, thank you for the effort that you put into this. I'm not
entirely happy about having a separate fork of the code base, but I'm sure
(some?) PyPy users will appreciate it.

Amaury Forgeot d'Arc, 14.06.2013 23:20:
> I almost finished a port of lxml that uses cffi bindings instead of Cython.
> The goal is to run it with PyPy, which support for the CPython API is known
> to be very slow and somewhat incomplete.
> cffi on the other hand is much more friendly to PyPy's JIT.
>
> The code is there: https://github.com/amauryfa/lxml in the "cffi" branch.
> Note that it works directly from the source tree, no need to compile.
> You have to use a recent PyPy, ensure that the "lxml" directory can be seen
> from your PYTHONPATH, and that you have libxml development packages.
>
> The conversion -- from .pyx files to pure Python -- turned out to be quite
> straightforward.

Most of it, it seems. From what I gathered, there are at least some major
adaptations that seem to have required some thought. The transformation is
certainly nothing that could be automated.

Indeed. I meant that I did not have to change the application design: same module names, same functions, very similar code.
Except maybe for the proxy module, and the FOR_EACH_FROM macros.
 
> And the result is interesting IMO:
>
> - Impact on the existing code is minimal (in __init__.py), of course there
> is a lot of duplication.

Not sure what you mean. Which __init__.py where?

The diff between the two branches: https://github.com/amauryfa/lxml/compare/master...cffi
shows that I added many files, but modification to existing files are small.

This means that the cffi version can coexist with the cython one.
 
> - Only 5 tests fail in the whole suite (if you except objectify, which is
> hard to support correctly on a non-refcounting gc)

Could you explain what the problem is here? lxml.objectify is certainly
quite tightly coupled with lxml.etree, but it mainly uses standard features
of the API, mainly custom element classes.

Well, this issue is already mentioned in the lxml docs:
http://lxml.de/objectify.html#defining-additional-data-classes
"""Be aware, though, that this does not immediately apply to elements to which there already is a Python reference. Their Python class will only be changed after all references are gone and the Python object is garbage collected.
"""
With a deterministic garbage collector, this is often not an issue. But with a delayed gc, many tests fail.
For example:

    from lxml import etree
    from lxml import objectify
    parser = etree.XMLParser(remove_blank_text=True)
    lookup = objectify.ObjectifyElementClassLookup()
    parser.set_element_class_lookup(lookup)
    root = parser.makeelement("root")
    root.s = "test"
    print type(root.s)  # <class 'lxml.objectify.ObjectifiedElement'>
    import gc;gc.collect()
    print type(root.s)  # <class 'lxml.objectify.StringElement'>
 
> - Performance is on par with the version installed on my laptop, except
> from some catastrophic benchmarks, probably because of a missed
> optimization. I did not try to optimize anything, correctness first!

Sure.


> My plan is to issue a pull request soon.

Is your goal to reintegrate this with mainline lxml? This will certainly
require more work. And I wouldn't be happy about the additional maintenance
overhead just to keep the additional cffi sources up-to-date.

Yes, my goal is to merge it into lxml, either in the main branch, or in a separate branch.
A separate fork is not too bad, until all the details are solved;
but when it works, it should be hosted on the main lxml site.
After all, it's not a separate project! redere Caesar...
And it's probably easier for distribution tools to have only one PyPI entry.
 
> Some work remains, though, and I need your help here:
> - adapt setup.py to the new system

Sounds like you'd better write a completely separate one. Most of what lxml
is doing at install time is configuring the build. A cffi version shouldn't
need that.

It probably needs a small block to declare the module as pure-python.
And it's better if the run-time distribution does not need the header files.
For this, it's necessary to import the modules once, and then redistribute the generated .so files.
 
> - fix the remaining tests
> - Is there a way to reduce the code duplication between .pyx and .py files?

Not easily. The Cython code merges everything into one module using
includes, but it uses Cython syntax. Maybe submodules could be reused.
Also, much of the implementation could be reduced to plain Python syntax,
but that makes it more verbose. Maybe still less verbose than duplicated
source files, but still too verbose to maintain IMHO. Plus, the functional
changes for cffi would still need some major special casing.

So, my current opinion on this is: I certainly won't do the work to merge
it, and it's currently unclear to me what a good way to merge this would
be. If it can be done without overly degrading maintenance, I'll consider
it, but otherwise, having it in a fork sounds simpler for now.

Yes, I don't disagree.
I have some experience on the maintenance of libraries though, and the maintenance work is of two kinds:

- There must be unit tests for each new feature/bug fix on the main implementation.
  Otherwise the alternate implementation will completely overlook the change.
  I've seen that lxml development does this seriously, so this part of the maintenance is already part of the process.

- Then, tests failing in the cffi implementation must be fixed.
  This is often a trivial task: port the changes from lxml/*.pyx to lxml-cffl/*.py. I understand it can become tedious, and for those cases a better way to share code is preferrable.
  Sometimes it's more complex though, because of differences in the underlying platform. But there even a more clever implementation won't help (remember all the bugs in cpyext...)


For the near future I suggest the following:
- I maintain this cffi version as a separate fork, fix the issues there, polish the remaining details and so on.
- At some point, I hope to convice you and have the cffi version hosted on the main lxml repo, as a parallel branch if you prefer.
 
--
Amaury Forgeot d'Arc