Hi, Just to let you know that we now have 8 stable buildbots, including Barry's own PPC Ubuntu machine (even though the Windows buildbots give a rather unconventional meaning to the word "stability"). Right now they are mostly green: http://www.python.org/dev/buildbot/all/waterfall?category=3.x.stable cheers Antoine.
Antoine Pitrou <solipsis@pitrou.net> writes:
(even though the Windows buildbots give a rather unconventional meaning to the word "stability").
Nag, nag, nag .... :-) There's been a bit of an uptick in the past few weeks with hung python_d processes (not a new issue, but it ebbs and flows), so I'm going to try to pull together a monitor script this weekend to start killing them off automatically. Should at least get rid of some of the low hanging fruit that interferes with subsequent builds. -- David
On Sun, Nov 14, 2010 at 12:40 PM, David Bolen <db3l.net@gmail.com> wrote:
Antoine Pitrou <solipsis@pitrou.net> writes:
(even though the Windows buildbots give a rather unconventional meaning to the word "stability").
Nag, nag, nag .... :-)
There's been a bit of an uptick in the past few weeks with hung python_d processes (not a new issue, but it ebbs and flows), so I'm going to try to pull together a monitor script this weekend to start killing them off automatically. Should at least get rid of some of the low hanging fruit that interferes with subsequent builds.
Do we have any idea why the workaround to avoid the popup windows stopped working? (assuming it ever worked reliably - I thought it did, but that impression may have been incorrect) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan <ncoghlan@gmail.com> writes:
Do we have any idea why the workaround to avoid the popup windows stopped working? (assuming it ever worked reliably - I thought it did, but that impression may have been incorrect)
Oh, the pop-up handling for the RTL dialogs still seems to be working fine (at least I haven't seen any since I put it in place). That, plus the original buildbot tweaks to block any OS popups still looks solid for avoiding any dialogs that block a test process. This is a completely separate issue, though probably around just as long, and like the popup problem its frequency changes over time. By "hung" here I'm referring to cases where something must go wrong with a test and/or its cleanup such that a python_d process remains running, usually several of them at the same time. So I end up with a bunch of python_d processes in the background (but not with any dialogs pending), which eventually cause errors during attempts the next time the same builder is used since the file remains in use. I expect some of this may be the lack of a good process group cleanup under Windows, though the root cause may not be unique to Windows. I see something very similar reasonable frequency on my OSX Tiger buildbot as well. But since the filesystem there can let the build tree get cleaned and rebuilt even with a stranded executable, the impact is minimal on subsequent tests than on Windows, though the OSX processes do burn a ton of CPU. I run a script on OSX to kill them off, but that was quick to whip up since in those cases the stranded processes all end up getting owned by init so it's a simple ps grep and kill. In the Windows case I'll probably just set a time limit so if the processes have been around more than a few hours I figure they're safe to kill. -- David
This is a completely separate issue, though probably around just as long, and like the popup problem its frequency changes over time. By "hung" here I'm referring to cases where something must go wrong with a test and/or its cleanup such that a python_d process remains running, usually several of them at the same time. So I end up with a bunch of python_d processes in the background (but not with any dialogs pending), which eventually cause errors during attempts the next time the same builder is used since the file remains in use.
This is what kill_python.exe is supposed to solve. So I recommend to investigate why it fails to kill the hanging Pythons. Regards, Martin
"Martin v. Löwis" <martin@v.loewis.de> writes:
This is what kill_python.exe is supposed to solve. So I recommend to investigate why it fails to kill the hanging Pythons.
Yeah, I know, and I can't say I disagree in principle - not sure why Windows doesn't let the kill in that module work (or if there's an issue actually running it under all conditions). At the moment though, I do know that using the sysinternals pskill utility externally (which is what I currently do interactively) definitely works so to be honest, automating that is a guaranteed bang for buck at this point with no analysis involved. Looking into kill_python or its use can be a follow-on. -- David
On 14 November 2010 02:40, David Bolen <db3l.net@gmail.com> wrote:
There's been a bit of an uptick in the past few weeks with hung python_d processes (not a new issue, but it ebbs and flows), so I'm going to try to pull together a monitor script this weekend to start killing them off automatically. Should at least get rid of some of the low hanging fruit that interferes with subsequent builds.
My buildslave (x86 XP-5, see http://www.python.org/dev/buildbot/buildslaves/moore-windows) runs buildbot as a service. I set it up that way as I assumed that would be the most sensible approach to avoid manual intervention on reboots, keeping a user session permanently running, etc. But it seems that there are a few areas where things don't work quite right when run from a service (see, for example, http://bugs.python.org/issue9931) and I assumed that some of my hung python_d processes were related to that. Do you run your slave as a service? (And for that matter, what do other Windows slave owners do?) Are there any "best practices" for ongoing admin of a Windows buildslave that might be worth collecting together? (I'll try to put some notes on what I've found together - maybe a page on the Python wiki would be the best place to collect them). Paul.
Paul Moore <p.f.moore@gmail.com> writes:
Do you run your slave as a service? (And for that matter, what do other Windows slave owners do?) Are there any "best practices" for ongoing admin of a Windows buildslave that might be worth collecting together? (I'll try to put some notes on what I've found together - maybe a page on the Python wiki would be the best place to collect them).
I've always run my slave interactively under Windows (well, started it interactively). Not sure if I tried a service in the beginning or not, it was a while ago. So your slave is probably the guinea pig for service operation. There is http://wiki.python.org/moin/BuildbotOnWindows (for which I can't take any credit). It could probably use a little love and updating, and it's largely aimed at setting things up, but not as much operating it. I think the only stuff I'm doing on my slave above and beyond the basic setup is a small patch to buildbot (circa 2007, couldn't get it back upstream at the time) to use SetErrorMode to disable OS pop-ups, and the AutoIt script (from earlier this year) to auto-acknowledge C RTL pop-ups. The kill script in this thread as a safety net above kill_python would be a third tweak. There was a buildbot fix for uploading that was only needed for the short-lived MSI generation, and which I think later buildbot versions have their own changes for. I'd be happy to work with you if you're willing to combine/edit our bits of information. Probably something we can take off-list, so just let me know. -- David
On Sun, Nov 14, 2010 at 02:48, David Bolen <db3l.net@gmail.com> wrote:
Nick Coghlan <ncoghlan@gmail.com> writes:
Do we have any idea why the workaround to avoid the popup windows stopped working? (assuming it ever worked reliably - I thought it did, but that impression may have been incorrect)
Oh, the pop-up handling for the RTL dialogs still seems to be working fine (at least I haven't seen any since I put it in place). That, plus the original buildbot tweaks to block any OS popups still looks solid for avoiding any dialogs that block a test process.
This is a completely separate issue, though probably around just as long, and like the popup problem its frequency changes over time. By "hung" here I'm referring to cases where something must go wrong with a test and/or its cleanup such that a python_d process remains running, usually several of them at the same time. So I end up with a bunch of python_d processes in the background (but not with any dialogs pending), which eventually cause errors during attempts the next time the same builder is used since the file remains in use.
I expect some of this may be the lack of a good process group cleanup under Windows, though the root cause may not be unique to Windows. I see something very similar reasonable frequency on my OSX Tiger buildbot as well. But since the filesystem there can let the build tree get cleaned and rebuilt even with a stranded executable, the impact is minimal on subsequent tests than on Windows, though the OSX processes do burn a ton of CPU. I run a script on OSX to kill them off, but that was quick to whip up since in those cases the stranded processes all end up getting owned by init so it's a simple ps grep and kill. In the Windows case I'll probably just set a time limit so if the processes have been around more than a few hours I figure they're safe to kill.
-- David
Is the dialog closer script available somewhere? I'm guessing this is the same script that closes the window which pops up during test_capi's crash? I just setup a Windows Server 2008 R2 x64 build slave and noticed it hanging due to the popup.
Brian Curtin <brian.curtin@gmail.com> writes:
Is the dialog closer script available somewhere? I'm guessing this is the same script that closes the window which pops up during test_capi's crash?
Not sure about that specific test, as I won't normally see the windows. If the failure is causing a C RTL pop-up, then yes, the script will be closing it. If the test is generating an OS level pop-up (process error dialog from the OS, not RTL) then that is instead suppressed for any of the child processes run on my slave, so it never shows up at all. The RTL script is trivial enough that I'll just include it inline: - - - - - - - - - - - - - - - - - - - - - - - - - ; buildbot.au3 ; Forceably acknowledge any RTL pop-ups that may occur during testing $MSVCRT = "Microsoft Visual C++ Runtime Library" while 1 ; Wait for any RTL pop-up and then acknowledge WinWait($MSVCRT) ControlClick($MSVCRT, "", "[CLASS:Button; TEXT:OK]") ; Safety check to avoid spinning if it doesn't go away Sleep(1000) WEnd - - - - - - - - - - - - - - - - - - - - - - - - - Execute with AutoIt3 (http://www.autoitscript.com/autoit3/). I just use the plain autoit3.exe against this script from the Startup folder. The error mode buildbot patch was discussed in the past on this list (or it might have been the python-3000-devel list at the time). Originally it just used pywin32, but I added a fallback to ctypes if available. When first done, we were still building pre-2.5 builds - I suppose at this point it could just assume the presence of ctypes. The patch below is from 0.7.11p3: - - - - - - - - - - - - - - - - - - - - - - - - - --- commands.py 2009-08-13 11:53:17.000000000 -0400 +++ /cygdrive/d/python/2.6/lib/site-packages/buildbot/slave/commands.py 2009-11-08 02:09:38.000000000 -0500 @@ -489,6 +489,23 @@ if not self.keepStdinOpen: self.pp.closeStdin() + # [db3l] Under Win32, try to control error mode + win32_SetErrorMode = None + if runtime.platformType == 'win32': + try: + import win32api + win32_SetErrorMode = win32api.SetErrorMode + except: + try: + import ctypes + win32_SetErrorMode = ctypes.windll.kernel32.SetErrorMode + except: + pass + + if win32_SetErrorMode: + log.msg(" Setting Windows error mode") + old_err_mode = win32_SetErrorMode(7) + # win32eventreactor's spawnProcess (under twisted <= 2.0.1) returns # None, as opposed to all the posixbase-derived reactors (which # return the new Process object). This is a nuisance. We can make up @@ -509,6 +526,10 @@ if not self.process: self.process = p + # [db3l] + if win32_SetErrorMode: + win32_SetErrorMode(old_err_mode) + # connectionMade also closes stdin as long as we're not using a PTY. # This is intended to kill off inappropriately interactive commands # better than the (long) hung-command timeout. ProcessPTY should be - - - - - - - - - - - - - - - - - - - - - - - - - -- David
Both the Tiger buildbots are suddenly failing 3.x on test_cmd_line. Looking at the changes since the last success, I can't see anything which would obviously affect that... Any suspects? Here's what's failing: ====================================================================== ERROR: test_run_code (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py", line 95, in test_run_code assert_python_failure('-c') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor ====================================================================== ERROR: test_run_module (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py", line 72, in test_run_module assert_python_failure('-m') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor ====================================================================== ERROR: test_version (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py", line 48, in test_version rc, out, err = assert_python_ok('-V') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 48, in assert_python_ok return _assert_python(True, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py", line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor Bill
In article <30929.1289879830@parc.com>, Bill Janssen <janssen@parc.com> wrote:
Both the Tiger buildbots are suddenly failing 3.x on test_cmd_line. Looking at the changes since the last success, I can't see anything which would obviously affect that... Any suspects?
It appears to be a duplicate of Issue8458. Playing with it again, it seems to be a race condition: sometimes I see all three failures you reported, sometimes just one, sometimes none. Again, only on 10.4 (Tiger), not 10.5 or 10.6. But the 10.4 machine I'm using is by far the slowest of the three so it is possible that could be a factor. Perhaps a race condition with cleaning up the p2c pipe from a previous run?
Here's what's failing:
====================================================================== ERROR: test_run_code (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 95, in test_run_code assert_python_failure('-c') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
====================================================================== ERROR: test_run_module (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 72, in test_run_module assert_python_failure('-m') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
====================================================================== ERROR: test_version (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 48, in test_version rc, out, err = assert_python_ok('-V') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 48, in assert_python_ok return _assert_python(True, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
-- Ned Deily, nad@acm.org
Ned Deily <nad@acm.org> wrote:
In article <30929.1289879830@parc.com>, Bill Janssen <janssen@parc.com> wrote:
Both the Tiger buildbots are suddenly failing 3.x on test_cmd_line. Looking at the changes since the last success, I can't see anything which would obviously affect that... Any suspects?
It appears to be a duplicate of Issue8458. Playing with it again, it seems to be a race condition: sometimes I see all three failures you reported, sometimes just one, sometimes none. Again, only on 10.4 (Tiger), not 10.5 or 10.6. But the 10.4 machine I'm using is by far the slowest of the three so it is possible that could be a factor.
Good thought. It's also the slowest of my buildbots -- dual 1GHz PPC.
Perhaps a race condition with cleaning up the p2c pipe from a previous run?
Here's what's failing:
====================================================================== ERROR: test_run_code (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 95, in test_run_code assert_python_failure('-c') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
====================================================================== ERROR: test_run_module (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 72, in test_run_module assert_python_failure('-m') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 55, in assert_python_failure return _assert_python(False, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
====================================================================== ERROR: test_version (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/test_cmd_line.py" , line 48, in test_version rc, out, err = assert_python_ok('-V') File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 48, in assert_python_ok return _assert_python(True, *args, **env_vars) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/test/script_helper.py" , line 29, in _assert_python env=env) File "/Users/buildbot/buildarea/3.x.parc-tiger-1/build/Lib/subprocess.py", line 683, in __init__ self.stdin = io.open(p2cwrite, 'wb', bufsize) OSError: [Errno 9] Bad file descriptor
-- Ned Deily, nad@acm.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/bill%40janssen.org
On 14-Nov-10 3:48 AM, David Bolen wrote:
This is a completely separate issue, though probably around just as long, and like the popup problem its frequency changes over time. By "hung" here I'm referring to cases where something must go wrong with a test and/or its cleanup such that a python_d process remains running, usually several of them at the same time.
My guess: the "hung" (single-threaded) Python process has called select() without a timeout in order to wait for some data. However, the data never arrives (due to a broken/failed test), and the select() never returns. On Windows, processes seem harder to kill when they get into this state. If I purposely wedge a Windows process via select() via the interactive interpreter, ctrl-c has absolutely no effect (whereas on Unix, ctrl-c will interrupt the select()). As for why kill_python.exe doesn't seem to be able to kill said wedged processes, the MSDN documentation on TerminateProcess[1] states the following: The terminated process cannot exit until all pending I/O has been completed or canceled. (sic) It's not unreasonable to assume a wedged select() constitutes pending I/O, so that's a possible explanation as to why kill_python.exe isn't able to terminate the processes. (Also, kill_python currently assumes TerminateProcess() always works; perhaps this optimism is misplaced. Also note the XXX TODO regarding the fact that we don't kill processes that have loaded our python*.dll, but may not be named python_d.exe. I don't think that's the issue here, though.) On 14-Nov-10 5:32 AM, David Bolen wrote:
"Martin v. Löwis"<martin@v.loewis.de> writes:
This is what kill_python.exe is supposed to solve. So I recommend to investigate why it fails to kill the hanging Pythons.
Yeah, I know, and I can't say I disagree in principle - not sure why Windows doesn't let the kill in that module work (or if there's an issue actually running it under all conditions).
At the moment though, I do know that using the sysinternals pskill utility externally (which is what I currently do interactively) definitely works so to be honest,
That's interesting. (That kill_python.exe doesn't kill the wedged processes, but pskill does.) kill_python is pretty simple, it just calls TerminateProcess() after acquiring a handle with the relevant PROCESS_TERMINATE access right. That being said, that's the recommended way to kill a process -- I doubt pskill would be going about it any differently (although, it is sysinternals... you never know what kind of crazy black magic it's doing behind the scenes). Are you calling pskill with the -t flag? i.e. kill process and all dependents? That might be the ticket, especially if killing the child process that wedged select() is waiting on causes it to return, and thus, makes it killable. Otherwise, if it happens again, can you try kill_python.exe first, then pskill, and confirm if the former fails but the latter succeeds? Trent. [1]: http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx
Trent Nelson <trent@snakebite.org> writes:
That's interesting. (That kill_python.exe doesn't kill the wedged processes, but pskill does.) kill_python is pretty simple, it just calls TerminateProcess() after acquiring a handle with the relevant PROCESS_TERMINATE access right. (...)
Are you calling pskill with the -t flag? i.e. kill process and all dependents? That might be the ticket, especially if killing the child process that wedged select() is waiting on causes it to return, and thus, makes it killable.
Nope, just "pskill python_d". Haven't bothered to check the pskill source but I'm assuming it's just a basic TerminateProcess. Ideally my quickest workaround would just be to replace the kill_python in the buildbot tools script with that command but of course they could get updated on checkouts and I'm not arguing it's generally appropriate enough to belong in the source. I suspect the problem may be on the "identify which process to kill" rather than the "kill it" part, but it's definitely going to take time to figure that out for sure. While the approach kill_python takes is much more appropriate, since we don't currently have multiple builds running simultaneously (and for me the machines are dedicated as build slaves, so I won't be having my own python_d), a more blanket kill operation is safe enough.
Otherwise, if it happens again, can you try kill_python.exe first, then pskill, and confirm if the former fails but the latter succeeds?
Yeah, I've got a temporary tree with a built-binary around, but still have to make sure of the right way to run it manually in a way that it will do the identification right (which I think also means I need to figure out from which build tree the hung process started). Up until now, typically when I've found a hung setup, the rest of the build tree which originally applied to that process has been cleaned. I definitely sympathize with Martin's position though - it wasn't the simplest tool to write (and I still have some email from him about the week+ it took just to test the process identification part remotely through buildbots at the time), so I regret not jumping right in to try to fix it. But it's just way more effort than typing "pskill python_d", at least with my current availability. -- David
I previously wrote:
I suspect the problem may be on the "identify which process to kill" rather than the "kill it" part, but it's definitely going to take time to figure that out for sure. While the approach kill_python takes is much more appropriate, since we don't currently have multiple builds running simultaneously (and for me the machines are dedicated as build slaves, so I won't be having my own python_d), a more blanket kill operation is safe enough.
For anyone interested, I caught (well, Georg Brandl caught it first) a case on Saturday with some hung processes on the Win7 buildbot that I was able to verify kill_python failed to kill. This was after having a few instances where it did work fine. I've created issue 10641 to track. I also noticed another recent issue (10136) that is also related to kill_python missing processes, but given that it works in my case some of the time, and is always called via the same path, not sure if that can be my problem. I also realized that some fraction of the other cases I have seen might not have truly been an issue, since from what I can see kill_python is only run at the start of a build process, so hung processes (even killable ones) from a prior build hang around until the next build takes place. They can however, interfere with the svn checkout so things never get to the point of using kill_python. So maybe kill_python's use should be moved to the clean stage? -- David
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/06/2010 07:24 PM, David Bolen wrote:
I previously wrote:
I suspect the problem may be on the "identify which process to kill" rather than the "kill it" part, but it's definitely going to take time to figure that out for sure. While the approach kill_python takes is much more appropriate, since we don't currently have multiple builds running simultaneously (and for me the machines are dedicated as build slaves, so I won't be having my own python_d), a more blanket kill operation is safe enough.
For anyone interested, I caught (well, Georg Brandl caught it first) a case on Saturday with some hung processes on the Win7 buildbot that I was able to verify kill_python failed to kill. This was after having a few instances where it did work fine.
I've created issue 10641 to track. I also noticed another recent issue (10136) that is also related to kill_python missing processes, but given that it works in my case some of the time, and is always called via the same path, not sure if that can be my problem.
I also realized that some fraction of the other cases I have seen might not have truly been an issue, since from what I can see kill_python is only run at the start of a build process, so hung processes (even killable ones) from a prior build hang around until the next build takes place. They can however, interfere with the svn checkout so things never get to the point of using kill_python. So maybe kill_python's use should be moved to the clean stage?
Maybe belt-and-suspenders it in both places. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkz+RuYACgkQ+gerLs4ltQ58OgCgiMs1JAdFjkOGxrJ3X3nB1k18 iKcAoMCP+MYumDe0r/XkZr29e7loACxP =wrWe -----END PGP SIGNATURE-----
Tres Seaver <tseaver@palladion.com> writes:
Maybe belt-and-suspenders it in both places.
The clean batch file is also called from the build step, so relocating it there should maintain the existing behavior as well. Hirokazu (ocean-city) pointed out in my new issue an earlier issue he created (#9973) that included a patch for such a change. In thinking about it some more, I suppose there's still a small window for a loss of communication during a test which results in clean not being run (which could then block the next svn checkout without an opportunity to kill the processes), so maybe the right place is actually at the end of the test batch file which is the step during which such hung processes might get generated? I don't know the history, if any, of it's current location in the flow. Relocating the use of kill_python seems reasonable to me, after which we could sort of wait and see if we run into any other hangs that interfere with the builds. I can't make such a change myself though. -- David
On 2010/12/08 0:11, David Bolen wrote:
In thinking about it some more, I suppose there's still a small window for a loss of communication during a test which results in clean not being run (which could then block the next svn checkout without an opportunity to kill the processes), so maybe the right place is actually at the end of the test batch file which is the step during which such hung processes might get generated? I don't know the history, if any, of it's current location in the flow.
Yes, but test can freeze. In that case, I'm worried that (snip) rt.bat .... # freeze here (will be halt by buildbot) vcbuild .... & kill_python_d # Will this be called? in test.bat.
Hirokazu Yamamoto <ocean-city@m2.ccsnet.ne.jp> writes:
Yes, but test can freeze. In that case, I'm worried that (snip) rt.bat .... # freeze here (will be halt by buildbot) vcbuild .... & kill_python_d # Will this be called? in test.bat.
Yeah, you're right. It may be impossible to completely eliminate the risk of a process getting stuck without doing something external to the build process, so long as the first thing a new build tries to do is an svn checkout that will fail if the process is still running. Having the kill operation in clean.bat covers the vast majority of the common cases with a minimum change, so seems the simplest. -- David
On 23 November 2010 23:18, David Bolen <db3l.net@gmail.com> wrote:
Trent Nelson <trent@snakebite.org> writes:
That's interesting. (That kill_python.exe doesn't kill the wedged processes, but pskill does.) kill_python is pretty simple, it just calls TerminateProcess() after acquiring a handle with the relevant PROCESS_TERMINATE access right. (...)
Are you calling pskill with the -t flag? i.e. kill process and all dependents? That might be the ticket, especially if killing the child process that wedged select() is waiting on causes it to return, and thus, makes it killable.
Nope, just "pskill python_d". Haven't bothered to check the pskill source but I'm assuming it's just a basic TerminateProcess. Ideally my quickest workaround would just be to replace the kill_python in the buildbot tools script with that command but of course they could get updated on checkouts and I'm not arguing it's generally appropriate enough to belong in the source.
After a long, long time (:-(), I'm finally getting a chance to look at this. I've patched buildbot as mentioned earlier in the thread, but I don't see where I should put the pskill command to make it work. At the moment, I have scheduled tasks to pskill python_d and vsjitdebugger. The python_d one runs daily and the debugger one hourly. (I daren't kill python_d too often, or I'll start killing in-progress tests, I assume). The vsjitdebugger one is there because I think it solves the CRT popup issue (I'll add the autoit script as well, but as I'm running as a service, I'm not sure the popup will alwats be visible for the autoit script to pick up...) Presumably, you're inserting a pskill command somewhere into the actual build process. I don't know much about buildbot, but I thought that was controlled by the master and/or the Python build scripts, neither of which I can change. If I want to add a pskill command just after a build/test has run (i.e., about where kill_python runs at the moment) how do I do that? Thanks, Paul.
Paul Moore <p.f.moore@gmail.com> writes:
Presumably, you're inserting a pskill command somewhere into the actual build process. I don't know much about buildbot, but I thought that was controlled by the master and/or the Python build scripts, neither of which I can change.
If I want to add a pskill command just after a build/test has run (i.e., about where kill_python runs at the moment) how do I do that?
I haven't been able to - as you say there's no good way to hook into the build process in real time as the changes have to be external or they'll get zapped on the next checkout. I suppose you could rapidly try to monitor the output of the build slave log file, but then you risk killing a process from a next step if you miss something or are too slow. And I've had cases (after long periods of continuous runtime) where the build slave log stops being generated even while the slave is running fine. Anyway, in the absence of changes to the build tree, I finally gave up and now run an external script (see below) that whacks any python_d process it finds running for more than 2 hours (arbitrary choice). I considered trying to dig deeper to identify processes with no logical test parent (more similar to the build kill_python itself), but decided it was too much effort for the minimal extra gain. So not terribly different from your once a day pskill, though as you say if you arbitrarily kill all python_d processes at any given point in time, you risk interrupting an active test. So the AutoIt script covers pop-ups and the script below cleans up hung processes. On the subject of pop-ups, I'm not sure but if you find your service not showing them try enabling the "Allow service to interact with the desktop" option in the service definition. In my experience though if a service can't perform a UI interaction, the interaction just fails, so I wouldn't expect the process to get stuck in that case. Anyway, in my case the kill script itself is Cygwin/bash based, but using the pstools tools, and conceptually just kills (pskill) any python_d process identified as having been running for 2 or more hours of wall time (via pslist): - - - - - - - - - - - - - - - - - - - - - - - - - #!/bin/sh # # kill_python.sh # # Quick 'n dirty script to watch for python_d processes that exceed a few # hours of runtime, then kill then assuming they're hung # PROC="python_d" TIMEOUT="2" while [ 1 ]; do echo "`date` Checking..." PIDS=`pslist 2>&1 | grep "^$PROC" | awk -v TIMEOUT=$TIMEOUT '{split($NF,fields,":"); if (int(fields[1]) >= int(TIMEOUT)) {print $2}}'` if [ "$PIDS" ]; then echo ===== `date` for pid in $PIDS; do pslist $pid 2>&1 | grep "^$PROC" pskill $pid done echo ===== fi sleep 300 done - - - - - - - - - - - - - - - - - - - - - - - - - It's a kludge, but as you say, for us to impose this on the build slave side requires it to be outside of the build tree. I've been running it for about a month now and it seems to be doing the job. I run a similar script on OSX (my Tiger slave also sometimes sees stuck processes, though they just burn CPU rather than interfere with tests), but there I can identify stranded python_d processes if they are owned by init, so the script can react more quickly. I'm pretty sure the best long term fix is to move the kill processing into the clean script (as per issue 9973) rather than where it currently is in the build script, but so far I don't think the idea has been able to attract the interest of anyone who can actually commit such a change. (See also the Dec continuation of this thread - http://www.mail-archive.com/python-dev@python.org/msg54389.html) I had also created issue 10641 from when I thought I found a problem with kill_python, but that turned out incorrect, and in subsequent tests kill_python in the build tree always worked, so the core issue seems to always be the failure to run it at all as opposed to it not working. For now though, these two external "monitors" seem to have helped contain the number of manual operations I have to do on my two Windows slaves. (Though recently I've begun seeing two new sorts of pop-ups under Windows 7 but both related to memory, so I think I just need to give my VM a little more memory) -- David
Hello,
I'm pretty sure the best long term fix is to move the kill processing into the clean script (as per issue 9973) rather than where it currently is in the build script, but so far I don't think the idea has been able to attract the interest of anyone who can actually commit such a change.
Thanks for bringing my attention on this. I've added a comment on that issue. If you say this should improve things, there's probably no reason not to commit such a patch. Regards Antoine.
On 30 January 2011 20:50, David Bolen <db3l.net@gmail.com> wrote:
I haven't been able to - as you say there's no good way to hook into the build process in real time as the changes have to be external or they'll get zapped on the next checkout. I suppose you could rapidly try to monitor the output of the build slave log file, but then you risk killing a process from a next step if you miss something or are too slow. And I've had cases (after long periods of continuous runtime) where the build slave log stops being generated even while the slave is running fine.
OK, sounds like I hadn't missed anything, then, which is good in some sense :-)
For now though, these two external "monitors" seem to have helped contain the number of manual operations I have to do on my two Windows slaves. (Though recently I've begun seeing two new sorts of pop-ups under Windows 7 but both related to memory, so I think I just need to give my VM a little more memory)
Yes, my (somewhat more simplistic) kill scripts had done some good as well. Having said that, http://bugs.python.org/issue9931 is currently stopping my buildslave (at least if I run it as a service), so it's a bit of a moot point at the moment... (One thing that might be good is if there were a means in the buildslave architecture to deliberately disable a test temporarily, if it's known to fail - I know ignoring errors isn't a good thing in general, but OTOH, having a slave effectively dead for months because of a known issue isn't a lot of help, either :-() Thanks for the reply. Paul.
participants (11)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Bill Janssen
-
Brian Curtin
-
David Bolen
-
Hirokazu Yamamoto
-
Ned Deily
-
Nick Coghlan
-
Paul Moore
-
Trent Nelson
-
Tres Seaver