[pypy-dev] which xml libraries? was (Re: PyPy 1.4 released)

Paolo Giarrusso p.giarrusso at gmail.com
Mon Nov 29 21:54:24 CET 2010

On Mon, Nov 29, 2010 at 14:40, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Amaury Forgeot d'Arc, 28.11.2010 11:44:
>> 2010/11/28 Maciej Fijalkowski
>>> On Sun, Nov 28, 2010 at 11:58 AM, René Dudfield wrote:
>>>> what xml libraries are people using with pypy?  What is working well?
>>> PyExpat works, although it's slow (ctypes-based implementation). I
>>> know genshi has some troubles with it, someone is debugging now.
>>> Besides I don't think there are any working (unless someone wrote a
>>> pure-python one)
>> PyExpat is now a built-in module, implemented in RPython,
>> and should have reasonable performance.
> Hmm, reasonable?
> $ ./bin/pypy -m timeit -s 'import xml.etree.ElementTree as ET' \
>      'ET.parse("ot.xml")'
> 10 loops, best of 3: 1.27 sec per loop
> $ python2.7 -m timeit -s 'import xml.etree.ElementTree as ET' \
>      'ET.parse("ot.xml")'
> 10 loops, best of 3: 486 msec per loop
> $ python2.7 -m timeit -s 'import xml.etree.cElementTree as ET' \
>      'ET.parse("ot.xml")'
> 10 loops, best of 3: 33.7 msec per loop

Is any JITting expected to trigger with so few iteractions? Or does
RPython saves the need for that? I tried increasing the loop count,
but I couldn't, because of two different bugs somewhere (in PyPy I

I tried ensuring that at least 1000 iterations were displayed, but
timeit doesn't work for more than 852 iterations on the attached
example (found on my HD):

$ pypy-trunk/pypy/translator/goal/pypy-c -m timeit -n 853 -s 'import
xml.etree.ElementTree as ET'      'ET.parse("extensionNames.xml")'
ImportError: No module named linecache

Now, even if linecache is imported locally, linecache.py exists
(located in the same path as timeit.py, i.e. lib-python/2.5.2/).

Furthermore, it works fine on the Python interpreter, suggesting that
the -m option might be part of the bug:
import timeit
a=timeit.Timer('ET.parse("extensionNames.xml")', 'import
xml.etree.ElementTree as ET')

However, a bigger timing count doesn't work:

>>>> a.timeit(10000)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/pgiarrusso/Documents/Research/Sorgenti/PyPy/pypy-trunk/lib-python/2.5.2/timeit.py",
line 161, in timeit
  File "<timeit-src>", line 6, in inner
  File "/Users/pgiarrusso/Documents/Research/Sorgenti/PyPy/pypy-trunk/lib_pypy/xml/etree/ElementTree.py",
line 862, in parse
  File "/Users/pgiarrusso/Documents/Research/Sorgenti/PyPy/pypy-trunk/lib_pypy/xml/etree/ElementTree.py",
line 579, in parse
IOError: [Errno 24] Too many open files: 'extensionNames.xml'

Inspection of the pypy process confirms a leak of file handles to the
XML files. Whether it is GC not being invoked, a missing destructor,
or simply because the code should release file handles, I dunno. Is
there a way to trigger explicit GC to workaround such issues?

Warning: all this is with a 32bit PyPy-1.4 on Mac OS X.

Paolo Giarrusso - Ph.D. Student
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extensionNames.xml
Type: text/xml
Size: 365 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20101129/084bba6c/attachment.xml>

More information about the Pypy-dev mailing list