[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Sean Harrington seanharr11 at gmail.com
Thu Oct 18 11:35:16 EDT 2018

You have correctly identified the summary of my intentions, and I agree
with your reasoning & concern - however there is a somewhat reasonable
answer as to why this optimization has never been implemented:

In Pool, the `task` tuple consists of (result_job, func, (x,), {}) .  This
is the object that is serialized/deserialized b/t processes.  The only
thing we really care about here is the tuple `(x,)`, confusingly, not
`func` (func is ACTUALLY either mapstar() or starmapstar(), which is called
with (x,) as its *args). Our element of interest is `(x,)` - a tuple of
(func, iterable). Because we need to temper the size of the `iterable`
bundled in each task, to avoid de/serialization slowness, we usually end up
with multiple tasks per worker, and thus multiple `func`s per worker. Thus,
this is really only an optimization in the case of really big
functions/closures/partials (or REALLY big iterables with an unreasonably
small chunksize passed to map()). The most common use case comes up when
passing instance methods (of really big objects!) to Pool.map().

This post
color in the above with more details.

Further, let me pivot on my idea of __qualname__...we can use the `id` of
`func` as the cache key to address your concern, and store this `id` on the
`task` tuple (i.e. an integer in-lieu of the `func` previously stored

On Thu, Oct 18, 2018 at 12:49 AM Michael Selik <michael.selik at gmail.com>

> If imap_unordered is currently re-pickling and sending func each time it's
> called on the worker, I have to suspect there was some reason to do that
> and not cache it after the first call. Rather than assuming that's an
> opportunity for an optimization, I'd want to be certain it won't have edge
> case negative effects.
> On Tue, Oct 16, 2018 at 2:53 PM Sean Harrington <seanharr11 at gmail.com>
> wrote:
>> Is your concern something like the following?
>> with Pool(8) as p:
>>     gen = p.imap_unordered(func, ls)
>>     first_elem = next(gen)
>>     p.apply_async(long_func, x)
>>     remaining_elems = [elem for elem in gen]
> My concern was passing the same function (or a function with the same
> qualname). You're suggesting caching functions and identifying them by
> qualname to avoid re-pickling a large stateful object that's shoved into
> the function's defaults or closure. Is that a correct summary?
> If so, how would the function cache distinguish between two functions with
> the same name? Would it need to examine the defaults and closure as well?
> If so, that means it's pickling the second one anyway, so there's no
> efficiency gain.
> In [1]: def foo(a):
>    ...:     def bar():
>    ...:         print(a)
>    ...:     return bar
> In [2]: f = foo(1)
> In [3]: g = foo(2)
> In [4]: f
> Out[4]: <function __main__.foo.<locals>.bar()>
> In [5]: g
> Out[5]: <function __main__.foo.<locals>.bar()>
> If we say pool.apply_async(f) and pool.apply_async(g), would you want the
> latter one to avoid serialization, letting the worker make a second call
> with the first function object?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181018/999d3aa6/attachment.html>

More information about the Python-Dev mailing list