I agree that blocking shutdown by default isn't a good idea. A child will eventually get indefinitely stuck on a nonresponsive connection and hang the whole server. This behavior change is surprising and should be reverted in master, and definitely not backported.

As for block-timeout or block-timeout-kill, waiting more than zero seconds in server_close() should be optional, because you're right that the best timeout is circumstantial. Since ThreadingMixIn also leaks threads, server_close() could grow a timeout flag (following the socket module timeout convention) and maybe a terminate boolean. ThreadingMixIn could then also be fixed. I'm not sure how useful that is though, since I'd bet almost all users of socketserver exit the process shortly after server_close(). Plus it can't be backported to the feature-freeze branches.

It seems like this is getting complicated enough that putting the fix in test_socketserver.py is probably best. Another solution is to add a secret terminating-close flag to ForkingMixIn just for the tests. Is that good practice in the stdlib?

On Fri, Aug 11, 2017 at 6:46 AM Victor Stinner <victor.stinner@gmail.com> wrote:
Hi,

I'm working on reducing the failure rate of Python CIs (Travis CI,
AppVeyor, buildbots). For that, I'm trying to reduce test side effects
using "environment altered" warnings. This week, I worked on
support.reap_children() which detects leaked child processes (usually
created with os.fork()).

I found a bug in the socketserver module: it waits for child processes
completion, but only in non-blocking mode. If a child process takes
too long, the server will never reads its exit status and so the
server leaks "zombie processes". Leaking processes can increase the
memory usage, spawning new processes can fail, etc.

=> http://bugs.python.org/issue31151

I changed the code to call waitpid() in blocking mode on each child
process on server_close(), to ensure that all children completed when
on server close:

https://github.com/python/cpython/commit/aa8ec34ad52bb3b274ce91169e1bc4a598655049

After pushing my change, I'm not sure anymore if it's a good idea.
There is a risk that server_close() blocks if a child is stuck on a
socket recv() or send() for some reasons.

Should we relax the code by waiting a few seconds (problem: hardcoded
timeouts are always a bad idea), or *terminate* processes (SIGKILL on
UNIX) if they don't complete fast enough?

I don't know which applications use socketserver. How I can test if it
breaks code in the wild?

At least, I didn't notice any regression on Python CIs.

Well, maybe the change is ok for the master branch. But I would like
your opinion because now I would like to backport the fix to 2.7 and
3.6 branches. It might break some applications.

If we *cannot* backport such change to 2.7 and 3.6 because it changes
the behaviour, I will fix the bug in test_socketserver.py instead.

What do you think?

Victor
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/rmsr%40lab.net