[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Sean Harrington seanharr11 at gmail.com
Thu Oct 4 05:55:29 EDT 2018

Starmap will serialize/deserialize the “big object” once for each task
created, so this is not performant. The goal is to pay the “one time cost”
of serialization of the “big object”, and still pass this object to func at
each iteration.
On Thu, Oct 4, 2018 at 4:14 AM Michael Selik <mike at selik.org> wrote:

> You don't like using Pool.starmap and itertools.repeat or a comprehension
> that repeats an object?
> On Wed, Oct 3, 2018, 6:30 PM Sean Harrington <seanharr11 at gmail.com> wrote:
>> Hi guys -
>> The solution to "lazily initialize" an expensive object in the worker
>> process (i.e. via @lru_cache) is a great solution (that I must admit I did
>> not think of). Additionally, in the second use case of "*passing a large
>> object to each worker process*", I also agree with your suggestion to
>> "shelter functions in a different module to avoid exposure to globals" as a
>> good solution if one is wary of globals.
>> That said, I still think "*passing a large object from parent process to
>> worker processes*" should be easier when using Pool. Would either of you
>> be open to something like the following?
>>            def func(x, big_cache=None):
>>                return big_cache[x]
>>            big_cache =  { str(k): k for k in range(10000) }
>>            ls = [ i for i in range(1000) ]
>> with Pool(func_kwargs={"big_cache": big_cache}) as pool:
>>     pool.map(func, ls)
>> It's a much cleaner interface (which presumably requires a more difficult
>> implementation) than my initial proposal. This also reads a lot better than
>> the "initializer + global" recipe (clear flow of data), and is less
>> constraining than the "define globals in parent" recipe. Most importantly,
>> when taking sequential code and parallelizing via Pool.map, this does not
>> force the user to re-implement "func" such that it consumes a global
>> (rather than a kwarg). It allows "func" to be used elsewhere (i.e. in the
>> parent process, from a different module, testing w/o globals, etc...)..
>> This would essentially be an efficient implementation of Pool.starmap(),
>> where kwargs are static, and passed to each application of "func" over our
>> iterable.
>> Thoughts?
>> On Sat, Sep 29, 2018 at 3:00 PM Michael Selik <mike at selik.org> wrote:
>>> On Sat, Sep 29, 2018 at 5:24 AM Sean Harrington <seanharr11 at gmail.com>
>>> wrote:
>>> >> On Fri, Sep 28, 2018 at 4:39 PM Sean Harrington <seanharr11 at gmail.com>
>>> wrote:
>>> >> > My simple argument is that the developer should not be constrained
>>> to make the objects passed globally available in the process, as this MAY
>>> break encapsulation for large projects.
>>> >>
>>> >> I could imagine someone switching from Pool to ThreadPool and getting
>>> >> into trouble, but in my mind using threads is caveat emptor. Are you
>>> >> worried about breaking encapsulation in a different scenario?
>>> >
>>> > >> Without a specific example on-hand, you could imagine a tree of
>>> function calls that occur in the worker process (even newly created
>>> objects), that should not necessarily have access to objects passed from
>>> parent -> worker. In every case given the current implementation, they will.
>>> Echoing Antoine: If you want some functions to not have access to a
>>> module's globals, you can put those functions in a different module.
>>> Note that multiprocessing already encapsulates each subprocesses'
>>> globals in essentially a separate namespace.
>>> Without a specific example, this discussion is going to go around in
>>> circles. You have a clear aversion to globals. Antoine and I do not.
>>> No one else seems to have found this conversation interesting enough
>>> to participate, yet.
>> >>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181004/59f10922/attachment.html>

More information about the Python-Dev mailing list