Mailman 3 Green buildbot failure. - Python-Dev

newer
(New) PEP 446: Make newly created...

Green buildbot failure.

Terry Reedy

11 Aug 2013 11 Aug '13

1:40 a.m.

This run recorded here shows a green test (it appears to have timed out) http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/7017 but the corresponding log for this Windows bot http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/7017/ste... has the expected os.chown failure. Are such green failures intended? -- Terry Jan Reedy

Show replies by date

Antoine Pitrou

11 Aug 11 Aug

10 a.m.

On Sat, 10 Aug 2013 21:40:46 -0400 Terry Reedy wrote:

...

This run recorded here shows a green test (it appears to have timed out) http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/7017 but the corresponding log for this Windows bot http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/7017/ste... has the expected os.chown failure.

You've got the answer at the bottom: "program finished with exit code 0" So for some reason, the test suite crashed, but with a successful exit code. Buildbot thinks it ran fine.

...

Are such green failures intended?

Not really, no. Regards Antoine.

Richard Oudkerk

12:27 p.m.

On 11/08/2013 11:00am, Antoine Pitrou wrote:

...

You've got the answer at the bottom:

"program finished with exit code 0"

So for some reason, the test suite crashed, but with a successful exit code. Buildbot thinks it ran fine.

Was the test terminated because it took too long? TerminateProcess(handle, exitcode) sometimes makes the program exit with return code 0 instead of exitcode. At any rate, test_multiprocessing contains this disabled test: # XXX sometimes get p.exitcode == 0 on Windows ... #self.assertEqual(p.exitcode, -signal.SIGTERM) -- Richard

Richard Oudkerk

12:41 p.m.

http://stackoverflow.com/questions/2061735/42-passed-to-terminateprocess-som... -- Richard

David Bolen

9:10 p.m.

Richard Oudkerk writes:

...

On 11/08/2013 11:00am, Antoine Pitrou wrote:

...
You've got the answer at the bottom:

"program finished with exit code 0"

So for some reason, the test suite crashed, but with a successful exit code. Buildbot thinks it ran fine.

Was the test terminated because it took too long?

Yes, it looks like it. This test (and one on the XP-4 buildbot in the same time frame) was terminated by an external watchdog script that kills python_d processes that have been running for more than 2 hours. I put the script in place (quite a while back) as a workaround for failures that would strand a python process, blocking future tests due to files remaining in use. It's a last ditch, crude, sledge-hammer. Historically, if this code ran, the buildbot had already itself timed out, so the exit code (which I can't control) wasn't very important. 2 hours had been conservative (and a trade-off as longer values also risks failing more future tests) but it may need to be increased. In this particular case it was a false alarm - the host was heavily loaded during this time frame, which I think prolonged the test time by an unusually large amount. -- David

Victor Stinner

9:49 p.m.

2013/8/11 David Bolen :

...

...
Was the test terminated because it took too long?

Yes, it looks like it.

This test (and one on the XP-4 buildbot in the same time frame) was terminated by an external watchdog script that kills python_d processes that have been running for more than 2 hours. I put the script in place (quite a while back) as a workaround for failures that would strand a python process, blocking future tests due to files remaining in use. It's a last ditch, crude, sledge-hammer.

test.regrtest uses faulthandler.dump_traceback_later() to stop the test after a timeout if --timeout command line option is used. http://docs.python.org/dev/library/faulthandler.html#faulthandler.dump_trace... Do you pass this option? The timeout is not global but one a single function of a test file, so you can use shorter timeout. It has also the advantage of dumping the traceback of all Python threads before exiting. It didn't try this feature recently on Windows, but it is supposed to work :-) Victor

David Bolen

10:49 p.m.

Victor Stinner writes:

...

test.regrtest uses faulthandler.dump_traceback_later() to stop the test after a timeout if --timeout command line option is used.

The slave doesn't actually control the test parameters, which come from build/Tools/buildbot/test.bat (which runs build/PCBuild/rt.bat) plus anything sent from the master. But no, it doesn't look like that flow is currently using --timeout, so the main timeout in place is that from the buildbot slave processing (currently 3900s and based on output activity by the process under test). Windows buildbots also have an additional "kill" path where the build scripts build and execute a separate kill_python_d executable (in PCBuild) to kill off any python_d process. It does have some sequencing issues (it runs during the build stage rather than clean) but no matter where it is used, being part of the build sequence risks it being skipped if the master/slave connection breaks mid-test. For some additional background, see email threads: http://mail.python.org/pipermail/python-dev/2010-November/105585.html http://mail.python.org/pipermail/python-dev/2010-December/106510.html http://mail.python.org/pipermail/python-dev/2011-January/107776.html Anyway, the termination in this particular case is completely separate from buildbot processing. It's a small script combining pslist/pskill from sysinternals (as pskill proved always able to kill the processes) and just looking for old python_d processes that just runs constantly in the background. My Windows buildbots have three additional layers of termination handling (beyond the standard buildbot timeout and kill_python in the test itself): 1. Modification to buildbot slave code to prevent Windows process and file dialogs. 2. Auto-it script in the background to acknowledge C RTL dialogs that the prior step doesn't block. (There have been past discussions about having Python itself disable RTL dialogs in test builds) 3. The external watchdog script as a fail-safe. The first two cases will definitely be recognized as test failures, since while the dialogs are suppressed/acknowledged, the triggering code will receive a failure result. The purpose of the watchdog script was to handle cases encountered for which the normal termination processing (buildbot or python itself) simply didn't seem to work. The buildbot slave/master thought the test ended or aborted, so started new tests, but a process remained stuck in memory from the prior test. The frequency of occurrence varied over time, but during some periods was a major pain in the neck adversely affecting buildbot stability. Not sure if faulthandler's approach to process termination would have more luck, or if it would even run if, for example, the process was stuck in the RTL or at the Win32 layer. I'd certainly be willing to retire the watchdog scripts (as long as I don't just end up firefighting stuck processes again), but I suspect the first challenge would be to figure out how to simulate an appropriately stuck process that would have required the watchdog script previously, given that it was never really obvious why they were hung. -- David

3908

Age (days ago)

3908

Last active (days ago)

List overview

Download

6 comments

5 participants

participants (5)

Antoine Pitrou
David Bolen
Richard Oudkerk
Terry Reedy
Victor Stinner

Green buildbot failure.

Terry Reedy

Antoine Pitrou

Richard Oudkerk

Richard Oudkerk

David Bolen

Victor Stinner

David Bolen

tags

participants (5)