[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Mon Oct 22 15:12:45 EDT 2018

On Mon, Oct 22, 2018 at 2:01 PM Michael Selik <mike at selik.org> wrote:

> This thread seems more appropriate for python-ideas than python-dev.
>
>

> On Mon, Oct 22, 2018 at 5:28 AM Sean Harrington <seanharr11 at gmail.com>
> wrote:
>
>> Michael - the initializer/globals pattern still might be necessary if you
>> need to create an object AFTER a worker process has been instantiated (i.e.
>> a database connection).
>>
>
> You said you wanted to avoid the initializer/globals pattern and have such
> things as database connections in the defaults or closure of the
> task-function, or the bound instance, no? Did I misunderstand?
>
>
> Further, the user may want to access all of the niceties of Pool, like
>> imap, imap_unordered, etc.  The goal (IMO) would be to preserve an
>> interface which many Python users have grown accustomed to, and to allow
>> them to access this optimization out-of-the-bag.
>>
>
> You just said that the dominant use-case was mapping a single
> task-function. It sounds like we're talking past each other in some way.
> It'll help to have a concrete example of a case that satisfies all the
> characteristics you've described: (1) no globals used for communication
> between initializer and task-functions; (2) single task-function, mapped
> once; (3) an instance-method as task-function, causing a large
> serialization burden; and (4) did I miss anything?
>

You're right, it's really only use cases (2) and (3) that define this spec.
However, the case for subclassing really boils down to the "free"
inheritance of the public methods of Pool (map, imap, imap_unordered,
etc...).  Why exclude these (by implementing "procmap()") if we get this
great return with such little investment?

>
>
>
>> Having talked to folks at the Boston Python meetup, folks on my dev team,
>> and perusing stack overflow, this "instance method parallelization" is a
>> pretty common pattern that is often times a negative return on investment
>> for the developer, due to the implicit implementation detail of pickling
>> the function (and object) once per task.
>>
>
> I believe you.
>
>
>> Is anyone open to reviewing a PR concerning this optimization of Pool,
>> delivered as a subclass? This feature restricts the number of unique tasks
>> being executed by workers at once to 1, while allowing aggressive
>> subprocess-level function cacheing to prevent repeated
>> serialization/deserialization of large functions/closures. The use case is
>> s.t. the user only ever needs 1 call to Pool.map(func, ls) (or friends)
>> executing at once, when `func` has a non-trivial memory footprint.
>>
>
> You're quite eager to have this PR merged. I understand that. However,
> it's reasonable to take some time to discuss the design of what you're
> proposing. You don't need it in the stdlib to get your own work done, nor
> to share it with others.
>

I am just eager to solve this problem, which is likely evident, given that
this is the 3rd different implementation discussed in detail since my
initial PR.  If the group consensus is that this is best implemented via
"procmap" function in github gist, then the idea will live there, and
likely have a lonely life there.

I contend that multiprocessing.Pool is used most frequently with a single
task. I am proposing a feature that enforces this invariant, optimizes task
memory-footprints & thus serialization time, and preserves the
well-established interface to Pool through subclassing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20181022/3bc110b2/attachment.html>