Re: [Python-Dev] test_fork1 on SMP? (was Re: [Python Dev] test_fork1 failing --with-threads (for some people)...)

This message duplicates some of my previous messages about the thread/fork deadlock. (Received report that some of my earlier messages appear to be lost.) I'm running a simpler version of Neil's test program that uses the thread module instead of the threading module. It's attached at the end of the message (test_fork2.py). It spawns five threads; then each thread, including the main thread, go into a while 1: loop that forks. The program locks up after about 100-200 forks. The process structure looks like this:
top reports the following CPU activity: PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 787 jeremy 19 0 2048 2048 1940 R 0 52.8 0.7 0:06 python 788 jeremy 17 0 2048 2048 1940 R 0 44.2 0.7 0:06 python If I attached to process 787 with gdb, this is the stack trace I can see: #0 pthread_cond_wait (cond=0x80f297c, mutex=0x80f2988) at queue.h:25 #1 0x806fee3 in PyThread_acquire_lock (lock=0x80f2978, waitflag=1) at ../../Python/thread_pthread.h:311 #2 0x8054ec1 in PyEval_RestoreThread (tstate=0x80fb2f0) at ../../Python/ceval.c:171 #3 0x80c1242 in posix_wait (self=0x0, args=0x80f1cac) at ../../Modules/posixmodule.c:2849 #4 0x8059b83 in call_builtin (func=0x8103578, arg=0x80f1cac, kw=0x0) at ../../Python/ceval.c:2369 #5 0x8059a41 in PyEval_CallObjectWithKeywords (func=0x8103578, arg=0x80f1cac, kw=0x0) at ../../Python/ceval.c:2337 #6 0x80583b1 in eval_code2 (co=0x8144ec0, globals=0x80f25d4, locals=0x0, args=0x80f1cb8, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at ../../Python/ceval.c:1675 #7 0x805a01d in call_function (func=0x8144cdc, arg=0x80f1cac, kw=0x0) at ../../Python/ceval.c:2491 #8 0x8059a21 in PyEval_CallObjectWithKeywords (func=0x8144cdc, arg=0x80f1cac, kw=0x0) at ../../Python/ceval.c:2335 #9 0x80b03aa in t_bootstrap (boot_raw=0x8142ad8) at ../../Modules/threadmodule.c:193 #10 0x4001ceca in pthread_start_thread (arg=0xbf3ffe60) at manager.c:213 After I collect the stack trace and restart the program, several other processes start running again: PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 786 jeremy 14 0 2048 2048 1928 R 0 24.6 0.7 0:09 python 788 jeremy 15 0 2048 2048 1928 R 0 24.6 0.7 0:34 python 787 jeremy 17 0 2048 2048 1928 R 0 24.4 0.7 0:34 python 783 jeremy 15 0 2048 2048 1928 R 0 23.8 0.7 0:09 python If I run the process again with thread debugging enabled (Py_DEBUG defined at compiel time and THREADDEBUG set to 1), I get the following trace: PyThread_init_thread called PyThread_allocate_lock() -> 0x80f2010 PyThread_acquire_lock(0x80f2010, 1) called PyThread_acquire_lock(0x80f2010, 1) -> 1 PyThread_release_lock(0x80f2010) called PyThread_allocate_lock called PyThread_allocate_lock() -> 0x80f6c58 PyThread_acquire_lock(0x80f6c58, 0) called PyThread_acquire_lock(0x80f6c58, 0) -> 1 PyThread_release_lock(0x80f6c58) called PyThread_acquire_lock(0x80f6c58, 0) called PyThread_acquire_lock(0x80f6c58, 0) -> 1 PyThread_release_lock(0x80f6c58) called PyThread_acquire_lock(0x80f6c58, 0) called PyThread_acquire_lock(0x80f6c58, 0) -> 1 Adding parser accelerators ... Done. PyThread_release_lock(0x80f6c58) called PyThread_acquire_lock(0x80f6c58, 0) called PyThread_acquire_lock(0x80f6c58, 0) -> 1 PyThread_release_lock(0x80f6c58) called PyThread_acquire_lock(0x80f6c58, 0) called PyThread_acquire_lock(0x80f6c58, 0) -> 1 PyThread_release_lock(0x80f6c58) called PyThread_allocate_lock called PyThread_allocate_lock() -> 0x80f6498 PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) -> 1 PyThread_start_new_thread called [lots of stuff omitted] 2051 2127 PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) called PyThread_release_lock(0x80f6498) called PyThread_acquire_lock(0x80f6498, 1) -> 1 PyThread_release_lock(0x80f6498) called PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) -> 1 PyThread_release_lock(0x80f6498) called PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) -> 1 1024 2130 PyThread_release_lock(0x80f6498) called PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) -> 1 PyThread_acquire_lock(0x80f6498, 1) called PyThread_acquire_lock(0x80f6498, 1) -> 1 PyThread_release_lock(0x80f6498) called [blocks] In this case, the deadlock occurs much more quickly. (Perhaps the extra I/O produced by the debugging output exacerbates the problem.) Jeremy PS If you don't have access to a Linux machine, you can use one of the Linux test machines on the SF compile farm. Log into linux.compile.sourceforge.net. import os import thread def f(): while 1: if os.fork() == 0: print "%s %s" % (thread.get_ident(), os.getpid()) os._exit(0) os.wait() thread.start_new_thread(f, ()) thread.start_new_thread(f, ()) thread.start_new_thread(f, ()) thread.start_new_thread(f, ()) thread.start_new_thread(f, ()) f()

[Jeremy Hylton, with an excellent writeup of what he's found out about his test_fork2.py test] It's a baffler! AFAIK, nobody yet has thought of a way that a fork can screw up the state of the locks in the *parent* process (it must be easy to see how they can get screwed up in a child, because two of us already did <wink>). But your hangs appear to be in parents, as the gdb stacktrace shows that a parent process is (presumably) hanging while trying to reacquire the global lock in the expansion of the Py_END_ALLOW_THREADS macro right after the pid = wait(&status); in posix_wait returns (#2 is from the macro expansion): #0 pthread_cond_wait (cond=0x80f297c, mutex=0x80f2988) at queue.h:25 #1 0x806fee3 in PyThread_acquire_lock (lock=0x80f2978, waitflag=1) at ../../Python/thread_pthread.h:311 #2 0x8054ec1 in PyEval_RestoreThread (tstate=0x80fb2f0) at ../../Python/ceval.c:171 #3 0x80c1242 in posix_wait (self=0x0, args=0x80f1cac) at ../../Modules/posixmodule.c:2849 However, the stacktrace shows that the parent in question is executing pthread_cond_wait, and this is the worst implementation of pthread_cond_wait in existence if it's using a busy loop to wait for the condition variable to get signaled! If this process keeps accumulating cycles, it either *is* the worst implementation in existence, or the process isn't actually hung and you just happened to capture the stack at a misleading time. The debug output may be misleading too:
... [lots of stuff omitted]
Ouch -- wish you hadn't. Debug output is likely to come out-of-order, so without *all* of it we can't see exactly how far out of balance the acquire and release calls got.
It's probably *not* blocking on the release: In this bunch of output, there are 7 calls to acquire, 5 calls to release, and 5 returns from acquire. That the # of release calls == the # of acquire returns suggests this output is "good enough", and if so that leaves two acquire calls waiting forever. Don't know what to do now. Obvious suggestions include putting many prints into test_fork2 so we get a better idea of what's happening, or going back to Trent's print-filled variant of test_fork1 (which I still think is an easier one to study, as it forks only once instead of unboundedly often). BTW, in *my* day, the answer to the question "what happens to a threaded program at a fork?" was to walk away laughing -- and not sure the reality of that has actually changed, regardless of how many more words they may have added to the pthreads docs in the meantime <0.7 wink>. silly-humans-ly y'rs - tim

On Mon, Jul 31, 2000 at 04:42:50AM -0400, Tim Peters wrote:
If I add Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS around fork() in posixmodule then the child is the process which always seems to hang. The child is hanging at: #0 0x4006d58b in __sigsuspend (set=0xbf7ffac4) at ../sysdeps/unix/sysv/linux/sigsuspend.c:48 #1 0x4001f1a0 in pthread_cond_wait (cond=0x8264e1c, mutex=0x8264e28) at restart.h:49 #2 0x806f3c3 in PyThread_acquire_lock (lock=0x8264e18, waitflag=1) at thread_pthread.h:311 #3 0x80564a8 in PyEval_RestoreThread (tstate=0x8265a78) at ceval.c:178 #4 0x80bf274 in posix_fork (self=0x0, args=0x8226ccc) at ./posixmodule.c:1659 #5 0x8059460 in call_builtin (func=0x82380e0, arg=0x8226ccc, kw=0x0) at ceval.c:2376 #6 0x8059378 in PyEval_CallObjectWithKeywords (func=0x82380e0, arg=0x8226ccc, kw=0x0) at ceval.c:2344 #7 0x80584f2 in eval_code2 (co=0x8265e98, globals=0x822755c, locals=0x0, args=0x8226cd8, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at ceval.c:1682 #8 0x805974b in call_function (func=0x8264ddc, arg=0x8226ccc, kw=0x0) at ceval.c:2498 #9 0x805936b in PyEval_CallObjectWithKeywords (func=0x8264ddc, arg=0x8226ccc, kw=0x0) at ceval.c:2342 #10 0x80af26a in t_bootstrap (boot_raw=0x8264e00) at ./threadmodule.c:199 #11 0x4001feca in pthread_start_thread (arg=0xbf7ffe60) at manager.c:213 Since there is only one thread in the child this should not be happening. Can someone explain this? I have tested this both a SMP Linux machine and a UP Linux machine. Neil

If I add Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS around fork() in posixmodule then the child is the process which always seems to hang.
I first thought that the lock should be released around the fork too, but later I realized that that was exactly wrong: if you release the lock before you fork, another thread will likely grab the lock before you fork; then in the child the lock is held by that other thread but that thread doesn't exist, so when the main thread tries to get the lock back it hangs in the Py_END_ALLOW_THREADS. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

[Jeremy Hylton, with an excellent writeup of what he's found out about his test_fork2.py test] It's a baffler! AFAIK, nobody yet has thought of a way that a fork can screw up the state of the locks in the *parent* process (it must be easy to see how they can get screwed up in a child, because two of us already did <wink>). But your hangs appear to be in parents, as the gdb stacktrace shows that a parent process is (presumably) hanging while trying to reacquire the global lock in the expansion of the Py_END_ALLOW_THREADS macro right after the pid = wait(&status); in posix_wait returns (#2 is from the macro expansion): #0 pthread_cond_wait (cond=0x80f297c, mutex=0x80f2988) at queue.h:25 #1 0x806fee3 in PyThread_acquire_lock (lock=0x80f2978, waitflag=1) at ../../Python/thread_pthread.h:311 #2 0x8054ec1 in PyEval_RestoreThread (tstate=0x80fb2f0) at ../../Python/ceval.c:171 #3 0x80c1242 in posix_wait (self=0x0, args=0x80f1cac) at ../../Modules/posixmodule.c:2849 However, the stacktrace shows that the parent in question is executing pthread_cond_wait, and this is the worst implementation of pthread_cond_wait in existence if it's using a busy loop to wait for the condition variable to get signaled! If this process keeps accumulating cycles, it either *is* the worst implementation in existence, or the process isn't actually hung and you just happened to capture the stack at a misleading time. The debug output may be misleading too:
... [lots of stuff omitted]
Ouch -- wish you hadn't. Debug output is likely to come out-of-order, so without *all* of it we can't see exactly how far out of balance the acquire and release calls got.
It's probably *not* blocking on the release: In this bunch of output, there are 7 calls to acquire, 5 calls to release, and 5 returns from acquire. That the # of release calls == the # of acquire returns suggests this output is "good enough", and if so that leaves two acquire calls waiting forever. Don't know what to do now. Obvious suggestions include putting many prints into test_fork2 so we get a better idea of what's happening, or going back to Trent's print-filled variant of test_fork1 (which I still think is an easier one to study, as it forks only once instead of unboundedly often). BTW, in *my* day, the answer to the question "what happens to a threaded program at a fork?" was to walk away laughing -- and not sure the reality of that has actually changed, regardless of how many more words they may have added to the pthreads docs in the meantime <0.7 wink>. silly-humans-ly y'rs - tim

On Mon, Jul 31, 2000 at 04:42:50AM -0400, Tim Peters wrote:
If I add Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS around fork() in posixmodule then the child is the process which always seems to hang. The child is hanging at: #0 0x4006d58b in __sigsuspend (set=0xbf7ffac4) at ../sysdeps/unix/sysv/linux/sigsuspend.c:48 #1 0x4001f1a0 in pthread_cond_wait (cond=0x8264e1c, mutex=0x8264e28) at restart.h:49 #2 0x806f3c3 in PyThread_acquire_lock (lock=0x8264e18, waitflag=1) at thread_pthread.h:311 #3 0x80564a8 in PyEval_RestoreThread (tstate=0x8265a78) at ceval.c:178 #4 0x80bf274 in posix_fork (self=0x0, args=0x8226ccc) at ./posixmodule.c:1659 #5 0x8059460 in call_builtin (func=0x82380e0, arg=0x8226ccc, kw=0x0) at ceval.c:2376 #6 0x8059378 in PyEval_CallObjectWithKeywords (func=0x82380e0, arg=0x8226ccc, kw=0x0) at ceval.c:2344 #7 0x80584f2 in eval_code2 (co=0x8265e98, globals=0x822755c, locals=0x0, args=0x8226cd8, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x0) at ceval.c:1682 #8 0x805974b in call_function (func=0x8264ddc, arg=0x8226ccc, kw=0x0) at ceval.c:2498 #9 0x805936b in PyEval_CallObjectWithKeywords (func=0x8264ddc, arg=0x8226ccc, kw=0x0) at ceval.c:2342 #10 0x80af26a in t_bootstrap (boot_raw=0x8264e00) at ./threadmodule.c:199 #11 0x4001feca in pthread_start_thread (arg=0xbf7ffe60) at manager.c:213 Since there is only one thread in the child this should not be happening. Can someone explain this? I have tested this both a SMP Linux machine and a UP Linux machine. Neil

If I add Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS around fork() in posixmodule then the child is the process which always seems to hang.
I first thought that the lock should be released around the fork too, but later I realized that that was exactly wrong: if you release the lock before you fork, another thread will likely grab the lock before you fork; then in the child the lock is held by that other thread but that thread doesn't exist, so when the main thread tries to get the lock back it hangs in the Py_END_ALLOW_THREADS. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
participants (4)
-
Guido van Rossum
-
Jeremy Hylton
-
Neil Schemenauer
-
Tim Peters