You don't like using Pool.starmap and itertools.repeat or a comprehension that repeats an object?


On Wed, Oct 3, 2018, 6:30 PM Sean Harrington <seanharr11@gmail.com> wrote:
Hi guys -

The solution to "lazily initialize" an expensive object in the worker process (i.e. via @lru_cache) is a great solution (that I must admit I did not think of). Additionally, in the second use case of "passing a large object to each worker process", I also agree with your suggestion to "shelter functions in a different module to avoid exposure to globals" as a good solution if one is wary of globals.

That said, I still think "passing a large object from parent process to worker processes" should be easier when using Pool. Would either of you be open to something like the following?

           def func(x, big_cache=None):
               return big_cache[x]

           big_cache =  { str(k): k for k in range(10000) }

           ls = [ i for i in range(1000) ]

with Pool(func_kwargs={"big_cache": big_cache}) as pool:
    pool.map(func, ls)

It's a much cleaner interface (which presumably requires a more difficult implementation) than my initial proposal. This also reads a lot better than the "initializer + global" recipe (clear flow of data), and is less constraining than the "define globals in parent" recipe. Most importantly, when taking sequential code and parallelizing via Pool.map, this does not force the user to re-implement "func" such that it consumes a global (rather than a kwarg). It allows "func" to be used elsewhere (i.e. in the parent process, from a different module, testing w/o globals, etc...)..

This would essentially be an efficient implementation of Pool.starmap(), where kwargs are static, and passed to each application of "func" over our iterable.

Thoughts?


On Sat, Sep 29, 2018 at 3:00 PM Michael Selik <mike@selik.org> wrote:
On Sat, Sep 29, 2018 at 5:24 AM Sean Harrington <seanharr11@gmail.com> wrote:
>> On Fri, Sep 28, 2018 at 4:39 PM Sean Harrington <seanharr11@gmail.com> wrote:
>> > My simple argument is that the developer should not be constrained to make the objects passed globally available in the process, as this MAY break encapsulation for large projects.
>>
>> I could imagine someone switching from Pool to ThreadPool and getting
>> into trouble, but in my mind using threads is caveat emptor. Are you
>> worried about breaking encapsulation in a different scenario?
>
> >> Without a specific example on-hand, you could imagine a tree of function calls that occur in the worker process (even newly created objects), that should not necessarily have access to objects passed from parent -> worker. In every case given the current implementation, they will.

Echoing Antoine: If you want some functions to not have access to a
module's globals, you can put those functions in a different module.
Note that multiprocessing already encapsulates each subprocesses'
globals in essentially a separate namespace.

Without a specific example, this discussion is going to go around in
circles. You have a clear aversion to globals. Antoine and I do not.
No one else seems to have found this conversation interesting enough
to participate, yet.

>>>