[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
seanharr11 at gmail.com
Fri Oct 19 07:58:27 EDT 2018
On Fri, Oct 19, 2018 at 7:32 AM Joni Orponen <j.orponen at 4teamwork.ch> wrote:
> On Fri, Oct 19, 2018 at 9:09 AM Thomas Moreau <
> thomas.moreau.2010 at gmail.com> wrote:
>> I have been working on the concurent.futures module lately and I think
>> this optimization should be avoided in the context of python Pools.
>> This is an interesting idea, however its implementation will bring many
>> complicated issues as it breaks the basic paradigm of a Pool: the tasks are
>> independent and you don't know which worker is going to run which task.
>> The function is serialized with each task because of this paradigm. This
>> ensure that any worker picking the task will be able to perform it
>> independently from the tasks it has run before, given that it as been
>> initialized correctly at the beginning. This makes it simple to run each
> I would not mind if there would be a subtype of Pool where you can only
> apply one kind of task to. This is a very common use mode.
Though the question there is 'should this live in Python itself'? I'd be
> fine with a package on PyPi.
Thomas makes a good point: despite the common user mode of calling
Pool.map() once, blocking, and returning, the need for serialization of
functions within tasks arises when calling Pool.map() (and friends) while
workers are still running (i.e. there was a previous call to
However this is an uncommon user mode, as Joni points out. The most common
user mode is that your Pool workers are only ever executing one type of
task at a given time. This opens up optimization opportunities, so long as
we store state on the subclassed Pool object of whether or not a SingleTask
is running, or has been completed(SingleTaskPool?), to prevent the user
from getting in this funky state above.
I would rather see this included in the multiprocessing stdlib, as it
seemingly will not introduce a lot of new code, would benefit from existing
tests. Most importantly it optimizes (in my opinion) the most common user
mode of Pool.
> As the Pool comes with no scheduler, with your idea, you would need a
>> synchronization step to send the function to all workers before you can
>> launch your task. But if there is already one worker performing a long
>> running task, does the Pool wait for it to be done before it sends the
>> function? If the Pool doesn't wait, how does it ensure that this worker
>> will be able to get the definition of the function before running it?
>> Also, the multiprocessing.Pool has some features where a worker can shut
>> itself down after a given number of tasks or a timeout. How does it ensure
>> that the new worker will have the definition of the function?
>> It is unsafe to try such a feature (sending only once an object) anywhere
>> else than in the initializer which is guaranteed to be run once per worker.
>> On the other hand, you mentioned an interesting point being that making
>> globals available in the workers could be made simpler. A possible solution
>> would be to add a "globals" argument in the Pool which would instanciate
>> global variables in the workers. I have no specific idea but on the
>> implementation of such features but it would be safer as it would be an
>> initialization feature.
> Would this also mean one could use a Pool in a context where threading is
> used? Currently using threading side effects unpicklables into the globals.
> Also being able to pass in globals=None would be optimal for a lot of use
We could do this - but we can easily get the same behavior by declaring a
"global" in "initializer" (albeit a pattern which I do not like). I like
the idea to extend the Pool class a bit more, but this is also my opinion.
> Joni Orponen
> Python-Dev mailing list
> Python-Dev at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev