[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

Mon Jul 12 22:48:07 CEST 2010

Greg Brockman <gdb at ksplice.com> added the comment:

Thanks much for taking a look at this!

> why are you terminating the second pass after finding a failed 
> process?
Unfortunately, if you've lost a worker, you are no longer guaranteed that cache will eventually be empty.  In particular, you may have lost a task, which could result in an ApplyResult waiting forever for a _set call.

More generally, my chief assumption that went into this is that the unexpected death of a worker process is unrecoverable.  It would be nice to have a better workaround than just aborting everything, but I couldn't see a way to do that.

> Unpickleable errors and other errors occurring in the worker body are
> not exceptional cases, at least not now that the pool is supervised
> by _handle_workers.
I could be wrong, but that's not what my experiments were indicating.  In particular, if an unpickleable error occurs, then a task has been lost, which means that the relevant map, apply, etc. will wait forever for completion of the lost task.

> I think the result should be set also in this case, so the user can
> inspect the exception after the fact.
That does sound useful.  Although, how can you determine the job (and the value of i) if it's an unpickleable error?  It would be nice to be able to retrieve job/i without having to unpickle the rest.

> For shutdown.patch, I thought this only happened in the worker 
> handler, but you've enabled this for the result handler too? I don't 
> care about the worker handler, but with the result handler I'm 
> worried that I don't know what ignoring these exceptions actually 
> means.
You have a good point.  I didn't think about the patch very hard.  I've only seen these exceptions from the worker handler, but AFAICT there's no guarantee that bad luck with the scheduler wouldn't result in the same problem in the result handler.  One option would be to narrow the breadth of the exceptions caught by _make_shutdown_safe (do we need to catch anything but TypeErrors?).  Another option would be to enable only for the worker handler.  I don't have a particularly great sense of what the Right Thing to do here is.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9205>
_______________________________________