Processes not exiting

Mon Aug 10 11:25:20 EDT 2009

ma3mju wrote:
> On 7 Aug, 16:02, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> ma3mju wrote:
>>> On 3 Aug, 09:36, ma3mju <matt.u... at googlemail.com> wrote:
>>>> On 2 Aug, 21:49, Piet van Oostrum <p... at cs.uu.nl> wrote:
>>>>>>>>>> MRAB <pyt... at mrabarnett.plus.com> (M) wrote:
>>>>>> M> I wonder whether one of the workers is raising an exception, perhaps due
>>>>>> M> to lack of memory, when there are large number of jobs to process.
>>>>> But that wouldn't prevent the join. And you would probably get an
>>>>> exception traceback printed.
>>>>> I wonder if something fishy is happening in the multiprocessing
>>>>> infrastructure. Or maybe the Fortran code goes wrong because it has no
>>>>> protection against buffer overruns and similar problems, I think.
>>>>> --
>>>>> Piet van Oostrum <p... at cs.uu.nl>
>>>>> URL:http://pietvanoostrum.com[PGP8DAE142BE17999C4]
>>>>> Private email: p... at vanoostrum.org
>>>> I don't think it's a memory problem, the reason for the hard and easy
>>>> queue is because for larger examples it uses far more RAM. If I run
>>>> all of workers with harder problems I do begin to run out of RAM and
>>>> end up spending all my time switching in and out of swap so I limit
>>>> the number of harder problems I run at the same time. I've watched it
>>>> run to the end (a very boring couple of hours) and it stays out of my
>>>> swap space and everything appears to be staying in RAM. Just hangs
>>>> after all "poison" has been printed for each process.
>>>> The other thing is that I get the message "here" telling me I broke
>>>> out of the loop after seeing the poison pill in the process and I get
>>>> all the things queued listed as output surely if I were to run out of
>>>> memory I wouldn't expect all of the jobs to be listed as output.
>>>> I have a serial script that works fine so I know individually for each
>>>> example the fortran code works.
>>>> Thanks
>>>> Matt
>>> Any ideas for a solution?
>> A workaround is to do them in small batches.
>>
>> You could put each job in a queue with a flag to say whether it's hard
>> or easy, then:
>>
>>      while have more jobs:
>>          move up to BATCH_SIZE jobs into worker queues
>>          create and start workers
>>          wait for workers to finish
>>          discard workers
> 
> Yeah, I was hoping for something with a bit more finesse. In the end I
> used pool instead with a callback function and that has solved the
> problem. I did today find this snippet;
> 
> Joining processes that use queues
> 
>     Bear in mind that a process that has put items in a queue will
> wait before terminating until all the buffered items are fed by the
> “feeder” thread to the underlying pipe. (The child process can call
> the Queue.cancel_join_thread() method of the queue to avoid this
> behaviour.)
> 
>     This means that whenever you use a queue you need to make sure
> that all items which have been put on the queue will eventually be
> removed before the process is joined. Otherwise you cannot be sure
> that processes which have put items on the queue will terminate.
> Remember also that non-daemonic processes will be automatically be
> joined.
> 
> 
> I don't know (not a computer scientist) but could it have been the
> pipe getting full?
> 
> In case anyway else is effected by this I've attached the new code to
> see the changes I made to fix it.
> 
[snip]
Maybe the reason is this:

Threads share an address space, so putting data into a queue simply
involves putting a reference there, but processes don't share an address
space, so a sender must continue to exist until the data itself has been
copied into the pipe that connects the processes. This pipe has a
limited capacity.

In your code you were waiting for the easy workers to terminate and you
weren't reading from the queue, and maybe, therefore, the pipe either,
so with a large number of jobs the pipe was becoming full.

In summary: the worker didn't terminate because the pipe was full; the
pipe was full because you weren't reading the results; you weren't
reading the results because the worker hadn't terminated.