I'm currently searching for a problem, I have debugged for quite a long time. I think this is the root problem why the pypy translation with pypy is still slower than cpython. Here are some of my findings (+questions):
The list of last tests that fail all have one thing in common: They have an issue with the gil/threading. (See ) Most interesting ones are the last five.
* test_gc_locking (2x) fail on the build bot (only using cpython), but not on my machine. This is strange because bbot and my vm use the same distro, same compiler version, .... the only difference is that bbot has better hardware and the tests are run with testrunner. Is there another way I can reproduce this?
* test_ping_pong. (-A test) ping pong from one thread to another stressing locking and the GIL switch. On s390x and the translated VM this takes really long (10 seconds and on bbot it seems to exceed 30 seconds when run in parallel). However if I run the same test with PYPYLOG=jit:- it completes in ~0.96 seconds (in gdb it is the same). If you subtract the time needed for printing you might end up with the same speed as x86 has for this test. What does the printing/gdbing trigger to let the GIL switch happen that smoothly?
* I have placed memory fences at the same positions as on ppc (2x isync and lwsync). Are there any other places that need to complete all pending the memory operations?
* There is one path in call_release_gil (just after the call) where rpy_fastgil was acquired (because it as 0) and the shadowstack is not the one of the current thread. Then *rpy_fastgil = 0 is set for the slowpath function. Wouldn't it be possible to steal the gil at this point? Would that lead to a problem?
On Tue, Feb 2, 2016 at 6:11 PM, Richard Plangger email@example.com wrote:
- There is one path in call_release_gil (just after the call) where
rpy_fastgil was acquired (because it as 0) and the shadowstack is not the one of the current thread. Then *rpy_fastgil = 0 is set for the slowpath function. Wouldn't it be possible to steal the gil at this point? Would that lead to a problem?
No, it's not a problem: setting *rpy_fastgil to zero releases the GIL again, and it will be re-acquired again by the called function (at reacqgil_addr). There is no issue if the GIL is re-acquired by a different thread exactly here; we will simply block in reacqgil_addr.
Note that we should in theory do a "lwsync" just before setting *rpy_fastgil to zero, like we do in call_releasegil_addr_and_move_real_arguments(). It's not done, but I *think* it doesn't hurt in this particular case. I may actually be very wrong, but I base my reasoning on these facts:
1) "*rpy_fastgil=0" can always appear to occur later, from the point of view of other processors, which is not a problem
2) in this case the "*rpy_fastgil=0" cannot appear to occur too early: it must appear after the "stdcxx" instruction changed it to 1, and there is no other store between that "stdcxx" and the following "*rpy_fastgil=0". So "lwsync" is not useful here.