Hi,
here's a little status update regarding lxml on PyPy. I got the basics of
lxml.etree working so far, mostly by patching up Cython and tracking down
bugs in PyPy's cpyext (CPython C-API compatibility) layer. I'm still
getting crashes during error reporting and didn't care much about XPath or
XSLT yet. But given that those do not have much Python interaction per se,
I don't expect major surprises on that front.
The results are very encouraging, given that PyPy lacks support for many of
the tweaks and hacks that are possible in CPython.
Here's a little parser benchmark:
$ python2.7 -m timeit -s 'import lxml.etree as et' 'et.parse("hamlet.xml")'
100 loops, best of 3: 4.61 msec per loop
$ pypy -m timeit -s 'import lxml.etree as et' 'et.parse("hamlet.xml")'
100 loops, best of 3: 5.74 msec per loop
Pretty acceptable. That makes lxml the fastest XML parser that currently
exists for PyPy.
And here's a worst case benchmark for element proxy instantiation and
iteration, likely the most heavily tuned parts of lxml when running in CPython:
$ python2.7 -m timeit -s 'import lxml.etree as et; \
t=et.parse("hamlet.xml")' 'list(t.iter())'
100 loops, best of 3: 2.71 msec per loop
$ pypy -m timeit -s 'import lxml.etree as et; \
t=et.parse("hamlet.xml")' 'list(t.iter())'
10 loops, best of 3: 28.2 msec per loop
That's about a factor of 10. Sounds huge, but it's actually not bad,
considering the amount of extra work that has to be done for PyPy here.
Certainly doesn't render it unusable, we are still talking milliseconds
after all. And so far, there hasn't gone any tuning into it, so it's not
the final word.
I'm pretty optimistic.
BTW, if you're interested in improvements on this front, you can help
getting this done faster by using the "donate" button on lxml's project
home page. Any donation will help in freeing some of my time for this.
Stefan