Dirk Rothe, 22.04.2012 07:48:
Some days ago, I played with some algorithms on PyPy and CPython that got their primary data from larger xml-files. Whereas the algorithms (combinatorics by recursion-heavy list-processing) got a nice speedup of about factor 5 (from 1 sec downto 200 msecs on my testdata) - the initial XML parsing + find()/findall() processing jumped from 5 msecs to 200 msecs.
I guess you switched to plain ElementTree? PyPy doesn't do all that badly here, but it's several times slower than the highly tuned C implementations in CPython: http://blog.behnel.de/index.php?p=210 For tree iteration in lxml (and the related .find*() methods), PyPy already seems to be pretty close to what I get in CPython. Sure, it depends on how many results you get back because passing them through the interface to PyPy isn't very fast, but at least the internal tree traversal speed isn't impacted. Examples: Complete traversal, one hit: $ python2.7 -m timeit -s 'import lxml.etree as et; \ t=et.parse("hamlet.xml")' \ 'list(t.iter("PLAY"))' 1000 loops, best of 3: 382 usec per loop $ pypy -m timeit -s 'import lxml.etree as et; t=et.parse("hamlet.xml")' \ 'list(t.iter("PLAY"))' 1000 loops, best of 3: 284 usec per loop Complete traversal, tons of hits: $ python2.7 -m timeit -s 'import lxml.etree as et; \ t=et.parse("hamlet.xml")' \ 'list(t.iter("LINE"))' 1000 loops, best of 3: 1.94 msec per loop $ pypy -m timeit -s 'import lxml.etree as et; t=et.parse("hamlet.xml")' \ 'list(t.iter("LINE"))' 100 loops, best of 3: 7.48 msec per loop Surprisingly enough, I get very unreliable results for PyPy here. Rerunning the above several times gives me this as the best result: $ pypy -m timeit -s 'import lxml.etree as et; t=et.parse("hamlet.xml")' \ 'list(t.iter("LINE"))' 100 loops, best of 3: 3.71 msec per loop So it seems that it *can* be pretty close to CPython for that as well. But your use case reminds me of iterparse(). There will certainly be some substantial overhead involved in running iterparse in PyPy. Currently, it seems to be about a factor of 15: $ pypy -m timeit -s 'import lxml.etree as et' \ 't=list(et.iterparse("hamlet.xml"))' 10 loops, best of 3: 157 msec per loop $ python2.7 -m timeit -s 'import lxml.etree as et' \ 't=list(et.iterparse("hamlet.xml"))' 100 loops, best of 3: 10.8 msec per loop Needs some work and a bit of profiling, I guess...
So, lxml on PyPy would be awesome!
You can support the progress. Stefan