[pypy-issue] [issue993] threading.Lock.release performance

Mon Jan 9 23:30:35 CET 2012

New submission from Vincent Pelletier <plr.vincent at gmail.com>:

I ran one of my projects[1] with pypy to check the performance increase compared
to cPython 2.7. pypy is at least 4 times faster. So far, so good.
When I ran cProfile under pypy, the functions using most time were Lock.acquire
and Lock.release.

A bit of background: 1480a is a usb protocol analyser[2], so it generates
traffic dumps which my module parses. It aggregates transactions using
"ply.yacc" pure python module. As I want to parse live capture and lpy.yacc
doesn't have an API allowing to push tokens individually, I use threads (one for
ply, and main interpreter thread for lower-level stream operations) and a custom
Queue.Queue variant (because profiling with cPython showed Queue locking scheme
to be very expensive, and I don't need the features it makes possible). See
iti1480a/parser.py:SimpleQueue for my implementation. As a test data set, I use
sample trace files provided with ITI's software (free as in beer, windows-only
but wine can install it fine).

Used command & output (timings provided as cross-reference between outputs):
$ time PYTHONPATH=/usr/lib/python2.7/dist-packages/ pypy-1.7/bin/pypy -m
iti1480a.bench < ~/.wine/drive_c/Program\ Files/International\ Test\
Instruments/1480A\ USB\ Protocol\ Analyzer/dvd_rw_drive.usb > /dev/null
real    0m4.147s
user    0m3.864s
sys     0m0.796s

Profiler stats, per thread:
Main thread (producer):
>>> p.sort_stats('cumulative').print_stats()
Mon Jan  9 19:59:59 2012    ../prof.out
         8238158 function calls (6884227 primitive calls) in 6.144 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1122909    0.183    0.000    4.279    0.000 iti1480a/parser.py:634(__call__)
   521522    0.070    0.000    3.777    0.000 iti1480a/parser.py:668(rxcmd)
   260744    0.039    0.000    3.707    0.000 iti1480a/parser.py:590(__call__)
   260745    0.269    0.000    3.681    0.000 iti1480a/parser.py:28(put)
   260746    3.034    0.000    3.034    0.000 {method 'release' of 'thread.lock'
objects}
  1362123    0.772    0.000    1.478    0.000 iti1480a/parser.py:789(read16)
   601346    0.181    0.000    0.319    0.000 iti1480a/parser.py:663(data)
  1362290    0.293    0.000    0.293    0.000 {method 'read' of
'cStringIO.StringI' objects}
   260746    0.278    0.000    0.278    0.000 {method 'acquire' of 'thread.lock'
objects}
[...]

Second thread (consumer):
>>> p.sort_stats('cumulative').print_stats()
Mon Jan  9 19:59:59 2012    ../thread_prof.out
         5310276 function calls (4804628 primitive calls) in 7.947 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
505629/260745    1.976    0.000    5.235    0.000 iti1480a/parser.py:20(get)
   489784    4.729    0.000    4.729    0.000 {method 'acquire' of 'thread.lock'
objects}
  1713792    0.489    0.000    0.489    0.000 {method 'append' of 'list' objects}
   856896    0.222    0.000    0.222    0.000 {method 'get' of 'dict' objects}
[...]

Here is my reading:
- ~2s of general imprecision from cProfile, as producer thread is supposed to
outlive consumer thread.
- total duration reported by cProfile (6 to 8s) shows similar imprecision
compared to "time" output
- producer thread main contention is lock.release
- consumer thread main contention is lock.acquire (though acquire is not a
bottleneck on producer side)
Generally, this means consumer spends most of its time waiting for producer
(this is what I expected). But also it means producer spends half of its run
time releasing the queue lock, and I find this surprising.
>From there, maybe acquire could be improved too, but it's harder to tell as it
is legitimate for it to block.

Asking on #pypy, I got the following answer:
< fijal> vpelletier: the problem is it is implemented in C and as such can't be
inlined by the JIT
< fijal> nor optimizations can be performed
< fijal> you could move parts of it to RPython
< fijal> and maybe even add some specialcasing to the JIT
< fijal> that would make things faster

Being new to pypy (and JIT in general), and after reading the code, I don't
think I can achieve anything by myself. So I decided to report this as a bug (as
fijal suggested too).

For completeness, I benchmarked Queue.Queue against my implementation (both
under pypy and without cProfile). To my great surprise, they are comparably fast:
Queue.Queue:
real    0m4.341s
user    0m3.976s
sys     0m1.060s
Custom implementation:
real    0m4.080s
user    0m3.920s
sys     0m0.644s

[1] https://github.com/vpelletier/ITI1480A-linux
[2] http://www.internationaltestinstruments.com/

----------
messages: 3684
nosy: pypy-issue, vpelletier
priority: performance bug
release: 1.7
status: unread
title: threading.Lock.release performance

________________________________________
PyPy bug tracker <tracker at bugs.pypy.org>
<https://bugs.pypy.org/issue993>
________________________________________