The new faulthandler module is now fully functional and it has no more known issue. Its timeout feature is used on regrtest to dump the Python backtrace and exit if a test takes more than 1 hour.
Using the regrtest timeout and faulthandler signal handlers (enable in regrtest), I started to collect tracebacks of all timeouts.
* test_threading.test_notify() on Windows http://bugs.python.org/issue11769 Not analyzed yet. I am unable to reproduce it in my VM.
* test_mmap.test_large_offset() on Mac OS X http://bugs.python.org/issue11779 May be related (and fixed) by issue #11277 which has a patch.
* test_threading.test_3_join_in_forked_from_thread() on Ubuntu http://bugs.python.org/issue11870 Only seen once.
* test_mmap.test_big_buffer() on Mac OS X (it's a crash, bus error) http://bugs.python.org/issue11277 The origin of the problem was already identified, but the trace proves that faulthandler is able to catch correctly SIGBUS ;-)
* test_ttk_guionly on Mac OS X (bus error) http://bugs.python.org/issue5120 Same as #11277 (the origin of the problem was already identified)
* test_io.test_interrupted_write_text() on FreeBSD http://bugs.python.org/issue11859 (there was already enough information without faulthandler)
* test_threadsignals.test_signals() on Mac OS X http://bugs.python.org/issue11768 Race condition (deadlock).
* test_multiprocessing.test_async_error_callback() on many OSes http://bugs.python.org/issue8428 Race condition.
I'm proud of #11768 (because I fixed it). The bug was a deadlock. It is usually very hard to reproduce such issue (a deadlock) and without faulthandler, the only available information was the name of the file. With faulthandler, we have not only the name of the test function, but also the full traceback of the hang, but also the traceback of all other threads.
Thanks to the faulthandler trace of #8428, with the traceback of all threads, Charles-Francois Natali was able to understand and fix another complex race condition in multiprocessing (at shutdown).
I also fixed other issues (not using faulthandler) and so ALL OUR 3.X BUILDBOTS ARE GREEN!
... ok ok, except:
- sparc Debian 3.x: offline since 21 days - PPC Leopard 3.x : "hg clean" fails with twisted.internet.error.ProcessExitedAlready, but I think that except this buildbot specific issue, it must be green - x86 Windows7 3.x: the master lost the connection with the slave on test_cmd_line, but it should be a sporadic problem
Anyway, if you see a "Timeout (1:00:00)!" or "Fatal error" (with a traceback) issue on a buildbot, please open a new issue (if it doesn't exist, search a least the name of the test file). If you have other problems related to regrtest timeout or faulthandler, contact me or open an issue.
Finally, I'm very happy to see that my faulthandler module was as useful as I expected: with more informations, we are now able to identify race conditions. I hope that we will fix all remaining threading, signal and subprocess race conditions!