[Python-Dev] Buildbots and faulthandler

Wed Apr 20 11:37:05 CEST 2011

Hi,

The new faulthandler module is now fully functional and it has no more
known issue. Its timeout feature is used on regrtest to dump the Python
backtrace and exit if a test takes more than 1 hour.

Using the regrtest timeout and faulthandler signal handlers (enable in
regrtest), I started to collect tracebacks of all timeouts.

Open issues:

 * test_threading.test_notify() on Windows
   http://bugs.python.org/issue11769
   Not analyzed yet. I am unable to reproduce it in my VM.

 * test_mmap.test_large_offset() on Mac OS X
   http://bugs.python.org/issue11779
   May be related (and fixed) by issue #11277 which has a patch.

 * test_threading.test_3_join_in_forked_from_thread() on Ubuntu
   http://bugs.python.org/issue11870
   Only seen once.

 * test_mmap.test_big_buffer() on Mac OS X (it's a crash, bus error)
   http://bugs.python.org/issue11277
   The origin of the problem was already identified, but the trace
   proves that faulthandler is able to catch correctly SIGBUS ;-)

 * test_ttk_guionly on Mac OS X (bus error)
   http://bugs.python.org/issue5120
   Same as #11277 (the origin of the problem was already identified)

Closed issues:

 * test_io.test_interrupted_write_text() on FreeBSD
   http://bugs.python.org/issue11859
   (there was already enough information without faulthandler)

 * test_threadsignals.test_signals() on Mac OS X
   http://bugs.python.org/issue11768
   Race condition (deadlock).

 * test_multiprocessing.test_async_error_callback() on many OSes
   http://bugs.python.org/issue8428
   Race condition.

I'm proud of #11768 (because I fixed it). The bug was a deadlock. It is
usually very hard to reproduce such issue (a deadlock) and without
faulthandler, the only available information was the name of the file.
With faulthandler, we have not only the name of the test function, but
also the full traceback of the hang, but also the traceback of all other
threads.

Thanks to the faulthandler trace of #8428, with the traceback of all
threads, Charles-Francois Natali was able to understand and fix another
complex race condition in multiprocessing (at shutdown).

I also fixed other issues (not using faulthandler) and so ALL OUR 3.X
BUILDBOTS ARE GREEN!

... ok ok, except:

 - sparc Debian 3.x: offline since 21 days
 - PPC Leopard 3.x : "hg clean" fails with
twisted.internet.error.ProcessExitedAlready, but I think that except
this buildbot specific issue, it must be green
 - x86 Windows7 3.x: the master lost the connection with the slave on
test_cmd_line, but it should be a sporadic problem

Anyway, if you see a "Timeout (1:00:00)!" or "Fatal error" (with a
traceback) issue on a buildbot, please open a new issue (if it doesn't
exist, search a least the name of the test file). If you have other
problems related to regrtest timeout or faulthandler, contact me or open
an issue.

Finally, I'm very happy to see that my faulthandler module was as useful
as I expected: with more informations, we are now able to identify race
conditions. I hope that we will fix all remaining threading, signal and
subprocess race conditions!

Victor