[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

Ask Solem report at bugs.python.org
Wed Jul 14 09:58:16 CEST 2010

Ask Solem <askh at opera.com> added the comment:

There's one more thing

     if exitcode is not None:
       cleaned = True
                if exitcode != 0 and not worker._termination_requested:
                    abnormal.append((worker.pid, exitcode))

Instead of restarting crashed worker processes it will simply bring down
the pool, right?

If so, then I think it's important to decide whether we want to keep
the supervisor functionality, and if so decide on a recovery strategy.

Some alternatives are:

A) Any missing worker brings down the pool.

B) Missing workers will be replaced one-by-one. A maximum-restart-frequency decides when the supervisor should give up trying to recover
the pool, and crash it.

C) Same as B, except that any process crashing when trying to get() will bring down the pool.

I think the supervisor is a good addition, so I would very much like to keep it. It's also a step closer to my goal of adding the enhancements added by Celery to multiprocessing.pool.

Using C is only a few changes away from this patch, but B would also be possible in combination with my accept_callback patch. It does pose some overhead, so it depends on the level of recovery we want to support.

accept_callback: this is a callback that is triggered when the job is reserved by a worker process. The acks are sent to an additional Queue, with an additional thread processing the acks (hence the mentioned overhead). This enables us to keep track of what the worker processes are doing, also get the PID of the worker processing any given job (besides from recovery, potential uses are monitoring and the ability to terminate a job (ApplyResult.terminate?). See http://github.com/ask/celery/blob/master/celery/concurrency/processes/pool.py


Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list