[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Michael Selik michael.selik at gmail.com
Tue Oct 16 09:27:32 EDT 2018


Would this change the other pool method behavior in some way if the user,
for whatever reason, mixed techniques?

imap_unordered will only block when nexting the generator. If the user
mingles nexting that generator with, say, apply_async, could the change
you're proposing have some side-effect?

On Tue, Oct 16, 2018, 5:09 AM Sean Harrington <seanharr11 at gmail.com> wrote:

> @Nataniel this is what I am suggesting as well. No cacheing - just storing
> the `fn` on each worker, rather than pickling it for each item in our
> iterable.
>
> As long as we store the `fn` post-fork on the worker process (perhaps as
> global), subsequent calls to Pool.map shouldn't be effected (referencing
> Antoine's & Michael's points that "multiprocessing encapsulates each
> subprocesses globals in a separate namespace").
>
> @Antoine - I'm making an effort to take everything you've said into
> consideration here.  My initial PR and talk
> <https://www.youtube.com/watch?v=DH0JVSXvxu0> was intended to shed light
> on a couple of pitfalls that I often see Python end-users encounter with
> Pool. Moving beyond my naive first attempt, and the onslaught of deserved
> criticism, it seems that we have an opportunity here: No changes to the
> interface, just an optimization to reduce the frequency of pickling.
>
> Raymond Hettinger may also be interested in this optimization, as he
> speaks (with great analogies) about different ways you can misuse
> concurrency in Python <https://www.youtube.com/watch?v=9zinZmE3Ogk>. This
> would address one of the pitfalls that he outlines: the "size of the
> serialized/deserialized data".
>
> Is this an optimization that either of you would be willing to review, and
> accept, if I find there is a *reasonable way* to implement it?
>
>
> On Fri, Oct 12, 2018 at 3:40 PM Nathaniel Smith <njs at pobox.com> wrote:
>
>> On Fri, Oct 12, 2018, 06:09 Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>>> On Fri, 12 Oct 2018 08:33:32 -0400
>>> Sean Harrington <seanharr11 at gmail.com> wrote:
>>> > Hi Nathaniel - this if this solution can be made performant, than I
>>> would
>>> > be more than satisfied.
>>> >
>>> > I think this would require removing "func" from the "task tuple", and
>>> > storing the "func" "once per worker" somewhere globally (maybe a class
>>> > attribute set post-fork?).
>>> >
>>> > This also has the beneficial outcome of increasing general performance
>>> of
>>> > Pool.map and friends. I've seen MANY folks across the interwebs doing
>>> > things like passing instance methods to map, resulting in "big" tasks,
>>> and
>>> > slower-than-sequential parallelized code. Parallelizing "instance
>>> methods"
>>> > by passing them to map, w/o needing to wrangle with staticmethods and
>>> > globals, would be a GREAT feature! It'd just be as easy as:
>>> >
>>> >     Pool.map(self.func, ls)
>>> >
>>> > What do you think about this idea? This is something I'd be able to
>>> take
>>> > on, assuming I get a few core dev blessings...
>>>
>>> Well, I'm not sure how it would work, so it's difficult to give an
>>> opinion.  How do you plan to avoid passing "self"?  By caching (by
>>> equality? by identity?)?  Something else?  But what happens if "self"
>>> changed value (in the case of a mutable object) in the parent?  Do you
>>> keep using the stale version in the child?  That would break
>>> compatibility...
>>>
>>
>> I was just suggesting that within a single call to Pool.map, it would be
>> reasonable optimization to only send the fn once to each worker. So e.g. if
>> you have 5 workers and 1000 items, you'd only pickle fn 5 times, rather
>> than 1000 times like we do now. I wouldn't want to get any fancier than
>> that with caching data between different map calls or anything.
>>
>> Of course even this may turn out to be too complicated to implement in a
>> reasonable way, since it would require managing some extra state on the
>> workers. But semantically it would be purely an optimization of current
>> semantics.
>>
>> -n
>>
>>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at python.org
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> https://mail.python.org/mailman/options/python-dev/seanharr11%40gmail.com
>>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/mike%40selik.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181016/549cd6e3/attachment.html>


More information about the Python-Dev mailing list