Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

16 Oct 2018

      Would this change the other pool method behavior in some way if the user,
for whatever reason, mixed techniques?

imap_unordered will only block when nexting the generator. If the user
mingles nexting that generator with, say, apply_async, could the change
you're proposing have some side-effect?

On Tue, Oct 16, 2018, 5:09 AM Sean Harrington  wrote:
...
@Nataniel this is what I am suggesting as well. No cacheing - just storing
the `fn` on each worker, rather than pickling it for each item in our
iterable.
As long as we store the `fn` post-fork on the worker process (perhaps as
global), subsequent calls to Pool.map shouldn't be effected (referencing
Antoine's & Michael's points that "multiprocessing encapsulates each
subprocesses globals in a separate namespace").
@Antoine - I'm making an effort to take everything you've said into
consideration here.  My initial PR and talk
https://www.youtube.com/watch?v=DH0JVSXvxu0 was intended to shed light
on a couple of pitfalls that I often see Python end-users encounter with
Pool. Moving beyond my naive first attempt, and the onslaught of deserved
criticism, it seems that we have an opportunity here: No changes to the
interface, just an optimization to reduce the frequency of pickling.
Raymond Hettinger may also be interested in this optimization, as he
speaks (with great analogies) about different ways you can misuse
concurrency in Python https://www.youtube.com/watch?v=9zinZmE3Ogk. This
would address one of the pitfalls that he outlines: the "size of the
serialized/deserialized data".
Is this an optimization that either of you would be willing to review, and
accept, if I find there is a *reasonable way* to implement it?
On Fri, Oct 12, 2018 at 3:40 PM Nathaniel Smith  wrote:
...
On Fri, Oct 12, 2018, 06:09 Antoine Pitrou  wrote:
...
On Fri, 12 Oct 2018 08:33:32 -0400
Sean Harrington  wrote:
...
Hi Nathaniel - this if this solution can be made performant, than I
would
be more than satisfied.
I think this would require removing "func" from the "task tuple", and
storing the "func" "once per worker" somewhere globally (maybe a class
attribute set post-fork?).
This also has the beneficial outcome of increasing general performance
of
Pool.map and friends. I've seen MANY folks across the interwebs doing
things like passing instance methods to map, resulting in "big" tasks,
and
slower-than-sequential parallelized code. Parallelizing "instance
methods"
by passing them to map, w/o needing to wrangle with staticmethods and
globals, would be a GREAT feature! It'd just be as easy as:
Pool.map(self.func, ls)
What do you think about this idea? This is something I'd be able to
take
on, assuming I get a few core dev blessings...
Well, I'm not sure how it would work, so it's difficult to give an
opinion.  How do you plan to avoid passing "self"?  By caching (by
equality? by identity?)?  Something else?  But what happens if "self"
changed value (in the case of a mutable object) in the parent?  Do you
keep using the stale version in the child?  That would break
compatibility...
I was just suggesting that within a single call to Pool.map, it would be
reasonable optimization to only send the fn once to each worker. So e.g. if
you have 5 workers and 1000 items, you'd only pickle fn 5 times, rather
than 1000 times like we do now. I wouldn't want to get any fancier than
that with caching data between different map calls or anything.
Of course even this may turn out to be too complicated to implement in a
reasonable way, since it would require managing some extra state on the
workers. But semantically it would be purely an optimization of current
semantics.
-n
...
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/seanharr11%40gmail.com
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/mike%40selik.org

Re: [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Michael Selik