bug in python arm-linux?: start_new_thread fails after popen

Hi, I have a python program which works fine on x86 but doesn't work on any of my arm-linux devices (Zaurus, Ipaq, SIMpad) - all of them are running Python 2.3.2 on top of glibc 2.3.2+linuxthreads on top of kernel 2.4.18-rmk6-pxa3 respectively kernel 2.4.19-rmk7. After a long week of debugging I now can reproduce the behaviour with the following minimal case: ---------------------------------------------------------------------- import thread from time import sleep import sys import os def createPipe(): print os.popen( "ls -l" ).read() def threadMain( name ): while True: sys.stderr.write( name+": i'm still running" ) sys.stderr.flush() sleep( 1 ) if __name__ == '__main__': createPipe() print "BEFORE start_new_thread" thread.start_new_thread( threadMain, ( "1", ) ) print "AFTER start_new_thread" createPipe() print "BEFORE start_new_thread" thread.start_new_thread( threadMain, ( "2", ) ) print "AFTER start_new_thread" sleep( 5 ) ----------------------------------------------------------------------- This program - as is - just hangs in the first start_new_thread() and never comes back. If you outcomment the first call to createPipe() out, then it works fine. In the first - hanging - case, an strace shows: ------------------------------------------------------------------------- write(1, "BEFORE start_new_thread\n", 24BEFORE start_new_thread ) = 24 rt_sigprocmask(SIG_BLOCK, ~[33], [RTMIN], 8) = 0 pipe([3, 4]) = 0 clone(child_stack=0x14e080, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND) = 949 write(4, "\314V\1@\5\0\0\0\0\0\0\0\0\0\0\0\34\213\37@\10c\1@\0\0"..., 148) = 148 rt_sigprocmask(SIG_SETMASK, NULL, ~[KILL STOP 33], 8) = 0 write(4, "\0\21\3@\0\0\0\0\0\0\0\0hO\n\0\2503\23\0\377\376\373\377"..., 148) = 148 rt_sigprocmask(SIG_SETMASK, NULL, ~[KILL STOP 33], 8) = 0 rt_sigsuspend(~[KILL STOP RTMIN 33] -------------------------------------------------------------------------- The program must be kill -9'ed at this point. Can anyone explain this behaviour to me - did I find a bug? Sincerely, -- :M: -------------------------------------------------------------------------- Dipl.-Inf. Michael 'Mickey' Lauer mickey@tm.informatik.uni-frankfurt.de Raum 10b - ++49 69 798 28358 Fachbereich Informatik und Biologie --------------------------------------------------------------------------

On Wed, Dec 17, 2003 at 03:01:00PM +0100, Michael Lauer wrote:
Hi, I have a python program which works fine on x86 but doesn't work on any of my arm-linux devices (Zaurus, Ipaq, SIMpad) - all of them are running Python 2.3.2 on top of glibc 2.3.2+linuxthreads on top of kernel 2.4.18-rmk6-pxa3 respectively kernel 2.4.19-rmk7.
Using threads and fork together seems to be a big smelly armpit in Python. There are also problems on redhat9, where signals in a fork+exec'd subprocess are blocked, for instance. This seemed to be a consequence of blocking all signals in thread_pthread.h:PyThread_start_new_thread(). Perhaps pthread_atfork() could be used to fix this problem, though I know next to nothing about pthreads beyond the documentation I glanced at back when I first became aware of my thread+fork problem and before writing this message. Jeff

On Wed, Dec 17, 2003 at 09:14:50AM -0600, Jeff Epler wrote:
There are also problems on redhat9, where signals in a fork+exec'd subprocess are blocked, for instance. This seemed to be a consequence of blocking all signals in thread_pthread.h:PyThread_start_new_thread().
To follow up on my own message, here is a program that demonstrates the problem I ran into with threads and signals. doit() uses system() to create a process that should kill itself with a signal nearly instantly, otherwise it sleeps for one second. If doit() is called from the main thread or before any threads are created, it prints a time well under 1 second. If it is run from a thread, it takes just over a second, because the delivery of the signal in the subprocess is blocked. Typical output: $ python thread-signal-problem.py not threaded Elapsed time 0.00420594215393 subthread Elapsed time 1.00974297523 main thread Elapsed time 0.00419104099274 import thread, time, os def doit(): t = time.time() os.system("kill $$; sleep 1") t1 = time.time() print "Elapsed time", t1-t print "not threaded" doit() print "subthread" thread.start_new(doit, ()) time.sleep(2) print "main thread" doit()

Jeff Epler <jepler@unpythonic.net> writes:
To follow up on my own message, here is a program that demonstrates the problem I ran into with threads and signals.
Can you find out what $$ is, and what the PIDs and thread IDs of all participating threads are? Regards, Martin

On Thu, Dec 18, 2003 at 10:22:34PM +0100, Martin v. Löwis wrote:
Can you find out what $$ is, and what the PIDs and thread IDs of all participating threads are?
I'm not sure what all information I should try to gather for you. Let me know if you think this is enough to file a bug report with... I changed the example to make it clearer that it's the subprocess ignoring the signal that is the problem, not anything in Python that is taking time to notice the death of a child process. Typical output (of course, pids and ppids change from run to run. I didn't find gettid(), I think thread.get_ident() is not what you were asking for): not threaded os.getpid() -> 6444 os.getppid() -> 6332 thread.get_ident() -> 1074152064 shell process 6445 thread ppid 6444 Elapsed time 0.00640296936035 subthread os.getpid() -> 6444 os.getppid() -> 6332 thread.get_ident() -> 1082547504 shell process 6447 thread ppid 6444 kill failed Elapsed time 1.01621508598 main thread os.getpid() -> 6444 os.getppid() -> 6332 thread.get_ident() -> 1074152064 shell process 6449 thread ppid 6444 Elapsed time 0.00641894340515 :r threadbug.py import thread, time, os def doit(): print "os.getpid() ->", os.getpid() print "os.getppid() ->", os.getppid() print "thread.get_ident() ->", thread.get_ident() t = time.time() os.system("echo shell process $$; echo thread ppid $PPID; kill $$; echo kill failed; sleep 1") t1 = time.time() print "Elapsed time", t1-t print print print "not threaded" doit() print "subthread" thread.start_new(doit, ()) time.sleep(2) print "main thread" doit()

Jeff Epler <jepler@unpythonic.net> writes:
Can you find out what $$ is, and what the PIDs and thread IDs of all participating threads are?
I'm not sure what all information I should try to gather for you. Let me know if you think this is enough to file a bug report with... I changed the example to make it clearer that it's the subprocess ignoring the signal that is the problem, not anything in Python that is taking time to notice the death of a child process.
That is an important observation; signals that are blocked in the parent process will be blocked in the child process as well. I'm not sure what to do about this: We apparently *want* the signals blocked in the thread, but we don't want them to be blocked in the process invoked through system(). Proposals are welcome. Regards, Martin

martin@v.loewis.de (Martin v. Löwis) writes:
Jeff Epler <jepler@unpythonic.net> writes:
Can you find out what $$ is, and what the PIDs and thread IDs of all participating threads are?
I'm not sure what all information I should try to gather for you. Let me know if you think this is enough to file a bug report with... I changed the example to make it clearer that it's the subprocess ignoring the signal that is the problem, not anything in Python that is taking time to notice the death of a child process.
That is an important observation; signals that are blocked in the parent process will be blocked in the child process as well.
I'm not sure what to do about this: We apparently *want* the signals blocked in the thread, but we don't want them to be blocked in the process invoked through system(). Proposals are welcome.
Does pthread_atfork() help? Cheers, mwh -- We've had a lot of problems going from glibc 2.0 to glibc 2.1. People claim binary compatibility. Except for functions they don't like. -- Peter Van Eynde, comp.lang.lisp

Michael Hudson <mwh@python.net> writes:
I'm not sure what to do about this: We apparently *want* the signals blocked in the thread, but we don't want them to be blocked in the process invoked through system(). Proposals are welcome.
Does pthread_atfork() help?
Most likely. system(3) is specified as being implemented through fork()/exec(), so an atfork handler should be invoked in any compliant implementation. We could install a child handler, which unblocks the signals we don't want to be blocked. Now, the question is: What signals precisely we don't want to be blocked? I think the answer is "All signals that have not explicitly been blocked by the application". OTOH, we already have PyOS_AfterFork, which could be used instead of pthread_atfork. Jeff, would you like to add some code there, to set all signal handlers into default for which Handlers lists that the default handling should occur? Regards, Martin

On Sun, Dec 21, 2003 at 11:27:45AM +0100, Martin v. Löwis wrote:
OTOH, we already have PyOS_AfterFork, which could be used instead of pthread_atfork. Jeff, would you like to add some code there, to set all signal handlers into default for which Handlers lists that the default handling should occur?
When using pthread_atfork, os.system never triggers my code. However, reimplementing os.system in terms of os.fork+os.execv, it does. I don't know if this is right or wrong according to pthread, but since it doesn't work on my platform the question is academic for me. Wouldn't the PyOS_AfterFork approach also require python to provide its own versions of any POSIX APIs that would typically be implemented in terms of fork (system(), popen(), and spawn*() come to mind)? Jeff Index: Python/thread_pthread.h =================================================================== RCS file: /cvsroot/python/python/dist/src/Python/thread_pthread.h,v retrieving revision 2.48 diff -u -r2.48 thread_pthread.h --- Python/thread_pthread.h 19 Nov 2003 22:52:22 -0000 2.48 +++ Python/thread_pthread.h 21 Dec 2003 23:03:52 -0000 @@ -143,6 +143,17 @@ * Initialization. */ +static void PyThread__fork_child(void) { + /* Mask all signals in the current thread before creating the new + * thread. This causes the new thread to start with all signals + * blocked. + */ + sigset_t childmask; + sigfillset(&childmask); + SET_THREAD_SIGMASK(SIG_UNBLOCK, &childmask, NULL); + fprintf(stderr, "PyThread__fork_child()\n"); fflush(stderr); +} + #ifdef _HAVE_BSDI static void _noop(void) @@ -157,6 +168,7 @@ pthread_t thread1; pthread_create(&thread1, NULL, (void *) _noop, &dummy); pthread_join(thread1, NULL); + pthread_atfork(NULL, NULL, PyThread__fork_child); } #else /* !_HAVE_BSDI */ @@ -167,6 +179,7 @@ #if defined(_AIX) && defined(__GNUC__) pthread_init(); #endif + pthread_atfork(NULL, NULL, PyThread__fork_child); } #endif /* !_HAVE_BSDI */

Jeff Epler wrote:
When using pthread_atfork, os.system never triggers my code. However, reimplementing os.system in terms of os.fork+os.execv, it does. I don't know if this is right or wrong according to pthread, but since it doesn't work on my platform the question is academic for me.
Interesting. I'd be curious to find out why this fails - it may be a bug in your system, in which case I'd say "tough luck, complain to the system vendor" (for Redhat 9, I'm tempted to say that anyway ...) Looking at what likely is the source of your system(3) implementation (glibc 2.3.2, sysdeps/unix/sysv/linux/i386/system.c), I see that the fork used inside system(3) is # define FORK() \ INLINE_SYSCALL (clone, 3, CLONE_PARENT_SETTID | SIGCHLD, 0, &pid) Atleast, this is the fork being used if __ASSUME_CLONE_THREAD_FLAGS is defined, which is the case for Linux >2.5.50. With this fork() implementation, atfork handlers won't be invoked, which clearly looks like a bug to me. You might want to upgrade glibc to glibc-2.3.2-27.9.7.i686.rpm and nptl-devel-2.3.2-27.9.7.i686.rpm. In this version, the definition of FORK is changed to #if defined __ASSUME_CLONE_THREAD_FLAGS && !defined FORK # define FORK() \ INLINE_SYSCALL (clone, 3, CLONE_PARENT_SETTID | SIGCHLD, 0, &pid) #endif which might actually do the right thing, assuming FORK is already defined to one that calls the atfork handlers.
Wouldn't the PyOS_AfterFork approach also require python to provide its own versions of any POSIX APIs that would typically be implemented in terms of fork (system(), popen(), and spawn*() come to mind)?
You are right. system(3) won't call our version of fork, so PyOS_AfterFork won't be invoked. So forget about this approach. Regards, Martin

On Mon, Dec 22, 2003 at 02:08:34AM +0100, Martin v. Loewis wrote:
With this fork() implementation, atfork handlers won't be invoked, which clearly looks like a bug to me. You might want to upgrade glibc to glibc-2.3.2-27.9.7.i686.rpm and nptl-devel-2.3.2-27.9.7.i686.rpm. In this version, the definition of FORK is changed to
I also tested a fedora machine with glibc-2.3.2-101.1 nptl-devel-2.3.2-101.1 and the pthread_atfork handler is still not called for system(). Opengroup's documentation for system() says [...] The environment of the executed command shall be as if a child process were created using fork(), and the child process invoked the sh utility using execl() [...] http://www.opengroup.org/onlinepubs/007904975/functions/system.html so this looks like a glibc/nptl bug. I filed it in redhat's bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=112517 Given all this, does it make sense to adopt a patch similar to the one I posted earlier, and ignore the bug in system() on any particular OS? system() and popen() are easy to replace in Python if anybody is really bothered by the problem. Jeff

Jeff Epler wrote:
Given all this, does it make sense to adopt a patch similar to the one I posted earlier, and ignore the bug in system() on any particular OS? system() and popen() are easy to replace in Python if anybody is really bothered by the problem.
I'd think so, yes. The patch needs to be elaborated, and I'd delay integration to wait for a Redhat response. We should also restrict this for 2.4 at the moment, to find out how other pthread implementations react to pthread_atfork. Regards, Martin

On Mon, Dec 22, 2003 at 03:46:14AM +0100, Martin v. Loewis wrote:
I'd think so, yes. The patch needs to be elaborated, and I'd delay integration to wait for a Redhat response.
Well, unlike the last two bugs I filed in redhat bugzilla, this one has already seen some activity. I hope they don't take the line that this is not a bug :( ------- Additional Comments From jakub@redhat.com 2003-12-22 03:53 ------- Why? There is no word about system() in http://www.opengroup.org/onlinepubs/007904975/functions/pthread_atfork.html and there doesn't seem to be anything in http://www.opengroup.org/onlinepubs/007904975/functions/system.html that would mandate system() to be implemented using fork() function. ------- Additional Comments From jepler@unpythonic.net 2003-12-22 08:01 ------- I base my claim that pthread_atfork() should be called by system() from reading http://www.opengroup.org/onlinepubs/007904975/functions/system.html The third paragraph under the heading "description" says this: "[CX] The environment of the executed command shall be as if a child process were created using fork(), and the child process invoked the sh utility using execl() as follows [...]" The rationale section expands on what the "environment of the executed command" is: "IEEE Std 1003.1-2001 places additional restrictions on system(). It requires that if there is a command language interpreter, the environment must be as specified by fork() and exec. This ensures, for example, that close-on- exec works, that file locks are not inherited, and that the process ID is different." The pthread_atfork web page says it "provides multi-threaded application programs with a standard mechanism for protecting themselves from fork() calls in a library routine or the application itself." I believe that system() is one such library routine, since it is specified to create an environment "as specified by fork()". This issue arose in the context of Python. For various reasons, signals are blocked in all threads besides the main thread. This means that in processes generated by fork() from subthreads, all signals are blocked, which leads to undesired behavior. Using the child argument of pthread_atfork() allows signals to be unblocked when using fork()+exec() to execute an external program, but not when system() is used.

On Mon, Dec 22, 2003 at 08:03:17AM -0600, Jeff Epler wrote:
Well, unlike the last two bugs I filed in redhat bugzilla, this one has already seen some activity. I hope they don't take the line that this is not a bug :(
Unfortunately, this is redhat's position. ------- Additional Comments From roland@redhat.com 2003-12-22 16:37 ------- I think it is clear that the specification refers to the elements of the child process state that survive exec, so that the executed command can perceive them as part of its "environment". You could submit an interpretation request, but I think the committee would concur with my reading. The specification of pthread_atfork refers to calls to fork, not to other parts of the POSIX.1 implementation. If your application calls system, and not fork, those clauses do not apply to it. roland@redhat.com changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution| |NOTABUG Status|ASSIGNED |CLOSED

Unfortunately, this is redhat's position.
------- Additional Comments From roland@redhat.com 2003-12-22 16:37 ------- I think it is clear that the specification refers to the elements of the child process state that survive exec, so that the executed command can perceive them as part of its "environment". You could submit an interpretation request, but I think the committee would concur with my reading. The specification of pthread_atfork refers to calls to fork, not to other parts of the POSIX.1 implementation. If your application calls system, and not fork, those clauses do not apply to it.
How hard would it be to reimplement our own system() and popen() using only POSIX calls, for POSIX systems? I've always thought of these to be pretty simple combinations of fork() and exec(), with an assumption of a working /bin/sh. Without error checking: int system(char *cmd) { pid = fork(); if (!pid) { /* child */ execl("/bin/sh", "-c", cmd, NULL); } else { /* parent */ int sts; waitpid(pid, *sts, 0); return sts; } --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, Dec 22, 2003 at 02:23:11PM -0800, Guido van Rossum wrote:
How hard would it be to reimplement our own system() and popen() using only POSIX calls, for POSIX systems? I've always thought of these to be pretty simple combinations of fork() and exec(), with an assumption of a working /bin/sh. Without error checking:
Yeah -- this is something we could do if necessary. the opengroup standard lists a few more things to "get right" in terms of signal handling for system(), but we can do all those in C or in Python.. Jeff

Guido van Rossum <guido@python.org> writes:
Unfortunately, this is redhat's position.
------- Additional Comments From roland@redhat.com 2003-12-22 16:37 ------- I think it is clear that the specification refers to the elements of the child process state that survive exec, so that the executed command can perceive them as part of its "environment". You could submit an interpretation request, but I think the committee would concur with my reading. The specification of pthread_atfork refers to calls to fork, not to other parts of the POSIX.1 implementation. If your application calls system, and not fork, those clauses do not apply to it.
How hard would it be to reimplement our own system() and popen() using only POSIX calls, for POSIX systems? I've always thought of these to be pretty simple combinations of fork() and exec(), with an assumption of a working /bin/sh. Without error checking:
I think it's a bit harder than what you post, but there's code in APUE for it... Cheers, mwh -- I'd certainly be shocked to discover a consensus. ;-) -- Aahz, comp.lang.python

Guido van Rossum wrote:
How hard would it be to reimplement our own system() and popen() using only POSIX calls, for POSIX systems? I've always thought of these to be pretty simple combinations of fork() and exec(), with an assumption of a working /bin/sh.
I would be concerned that we bypass magic that the system vendor put into system(), which is essential for proper operation. For example, on Linux, system() blocks SIGINT in the parent process while the child is running. Also, the shell executable that system() uses may not be /bin/sh. OTOH, using the same underlying implementation on all systems makes Python behave more predictable. In the specific case, we would not even need pthread_atfork anymore, as we now could invoke PyOS_AfterFork in the child ourselves. Regards, Martin

On Tue, Dec 23, 2003 at 10:01:58AM +0100, Martin v. Loewis wrote:
I would be concerned that we bypass magic that the system vendor put into system(), which is essential for proper operation. For example, on Linux, system() blocks SIGINT in the parent process while the child is running. Also, the shell executable that system() uses may not be /bin/sh.
This behavior (blocking SIGINT and SIGQUIT in the parent) is part of the specification of system(): The system() function shall ignore the SIGINT and SIGQUIT signals, and shall block the SIGCHLD signal, while waiting for the command to terminate. If this might cause the application to miss a signal that would have killed it, then the application should examine the return value from system() and take whatever action is appropriate to the application if the command terminated due to receipt of a signal. Jeff

How hard would it be to reimplement our own system() and popen() using only POSIX calls, for POSIX systems? I've always thought of these to be pretty simple combinations of fork() and exec(), with an assumption of a working /bin/sh.
I would be concerned that we bypass magic that the system vendor put into system(), which is essential for proper operation. For example, on Linux, system() blocks SIGINT in the parent process while the child is running. Also, the shell executable that system() uses may not be /bin/sh.
In practice, I think we can do as well as vendors -- there really isn't that much to it. Systems where /bin/sh doesn't exist will have other problems... And we get to do it right when called from a thread.
OTOH, using the same underlying implementation on all systems makes Python behave more predictable.
In the specific case, we would not even need pthread_atfork anymore, as we now could invoke PyOS_AfterFork in the child ourselves.
Right! --Guido van Rossum (home page: http://www.python.org/~guido/)

Jeff Epler wrote:
You could submit an interpretation request, but I think the committee would concur with my reading.
Hmm. This is what I now did: I submitted an interpretation request to PASC: Edition of Specification (Year): 2003 Defect code : 3. Clarification required It is unclear whether calling system(3) invokes atfork handlers in a conforming implementation. system(3) specifies "The environment of the executed command shall be as if a child process were created using fork(), and the child process invoked the sh utility using execl() as follows:" In particular, usage of the word "environment" is confusing here. It may refer just to environment variables, however, the "Application usage" sections indicates that also signal handlers should be arranged as if the process was created through fork() and execl(). This still makes not clear whether handlers installed through pthread_atfork() are invoked. Action: The description of system() should change to "system() behaves as if a new process was created using fork(), and the child process invoked the sh utility using execl() ..." In addition, the application usage section should make it clear that atfork handlers are invoked. Regards, Martin

Jeff Epler <jepler@unpythonic.net> writes:
On Wed, Dec 17, 2003 at 03:01:00PM +0100, Michael Lauer wrote:
Hi, I have a python program which works fine on x86 but doesn't work on any of my arm-linux devices (Zaurus, Ipaq, SIMpad) - all of them are running Python 2.3.2 on top of glibc 2.3.2+linuxthreads on top of kernel 2.4.18-rmk6-pxa3 respectively kernel 2.4.19-rmk7.
Using threads and fork together seems to be a big smelly armpit in Python. There are also problems on redhat9, where signals in a fork+exec'd subprocess are blocked, for instance.
This is in no way restricted to RH9...
This seemed to be a consequence of blocking all signals in thread_pthread.h:PyThread_start_new_thread(). Perhaps pthread_atfork()
Perhaps. Cheers, mwh -- I have no disaster recovery plan for black holes, I'm afraid. Also please be aware that if it one looks imminent I will be out rioting and setting fire to McDonalds (always wanted to do that) and probably not reading email anyway. -- Dan Barlow

on Wed, Dec 17, 2003 at 03:01:00PM +0100, Michael Lauer wrote:
This program - as is - just hangs in the first start_new_thread() and never comes back. If you outcomment the first call to createPipe() out, then it works fine.
I can't reproduce this problem on my Debian/ARM machine. I'll try it on the iPAQ under Familiar later. My guess is that your glibc is too old. p.
participants (7)
-
Guido van Rossum
-
Jeff Epler
-
Martin v. Loewis
-
martin@v.loewis.de
-
Michael Hudson
-
Michael Lauer
-
Phil Blundell