[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Wed Oct 3 21:30:21 EDT 2018

Hi guys -

The solution to "lazily initialize" an expensive object in the worker
process (i.e. via @lru_cache) is a great solution (that I must admit I did
not think of). Additionally, in the second use case of "*passing a large
object to each worker process*", I also agree with your suggestion to
"shelter functions in a different module to avoid exposure to globals" as a
good solution if one is wary of globals.

That said, I still think "*passing a large object from parent process to
worker processes*" should be easier when using Pool. Would either of you be
open to something like the following?

           def func(x, big_cache=None):
               return big_cache[x]

           big_cache =  { str(k): k for k in range(10000) }

           ls = [ i for i in range(1000) ]

with Pool(func_kwargs={"big_cache": big_cache}) as pool:

    pool.map(func, ls)

It's a much cleaner interface (which presumably requires a more difficult
implementation) than my initial proposal. This also reads a lot better than
the "initializer + global" recipe (clear flow of data), and is less
constraining than the "define globals in parent" recipe. Most importantly,
when taking sequential code and parallelizing via Pool.map, this does not
force the user to re-implement "func" such that it consumes a global
(rather than a kwarg). It allows "func" to be used elsewhere (i.e. in the
parent process, from a different module, testing w/o globals, etc...)..

This would essentially be an efficient implementation of Pool.starmap(),
where kwargs are static, and passed to each application of "func" over our
iterable.

Thoughts?

On Sat, Sep 29, 2018 at 3:00 PM Michael Selik <mike at selik.org> wrote:

> On Sat, Sep 29, 2018 at 5:24 AM Sean Harrington <seanharr11 at gmail.com>
> wrote:
> >> On Fri, Sep 28, 2018 at 4:39 PM Sean Harrington <seanharr11 at gmail.com>
> wrote:
> >> > My simple argument is that the developer should not be constrained to
> make the objects passed globally available in the process, as this MAY
> break encapsulation for large projects.
> >>
> >> I could imagine someone switching from Pool to ThreadPool and getting
> >> into trouble, but in my mind using threads is caveat emptor. Are you
> >> worried about breaking encapsulation in a different scenario?
> >
> > >> Without a specific example on-hand, you could imagine a tree of
> function calls that occur in the worker process (even newly created
> objects), that should not necessarily have access to objects passed from
> parent -> worker. In every case given the current implementation, they will.
>
> Echoing Antoine: If you want some functions to not have access to a
> module's globals, you can put those functions in a different module.
> Note that multiprocessing already encapsulates each subprocesses'
> globals in essentially a separate namespace.
>
> Without a specific example, this discussion is going to go around in
> circles. You have a clear aversion to globals. Antoine and I do not.
> No one else seems to have found this conversation interesting enough
> to participate, yet.

>>>

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181003/d162a147/attachment.html>