[Python-ideas] fork

Andrew Barnert abarnert at yahoo.com
Wed Jul 29 19:00:42 CEST 2015

On Jul 29, 2015, at 18:46, Sven R. Kunze <srkunze at mail.de> wrote:
> Hi everybody,
> well during the discussion of the concurrency capabilities of Python, I found this article reading worthwhile: http://chriskiehl.com/article/parallelism-in-one-line/ His statement "Mmm.. Smell those Java roots." basically sums the whole topic up for me.
> That is sequential code (almost plain English):
> for image in images: 
>     create_thumbnail(image)
> In order to have a start with parallelism and concurrency, we need to do the following:
> pool = Pool()
> pool.map(create_thumbnail, images)
> pool.close() 
> pool.join()

No you don't, because this isn't Java:

    with Pool() as pool:
        pool.map(create_thumbnail, images)

Also note that if create_thumbnail returns a value, to assemble all the values into a list is just as simple:

    with Pool() as pool:
        thumbnails = pool.map(create_thumbnail, images)

And if you want to iterate over the thumbnails as created and don't care about the order, you can just replace the map with imap_unordered. (Or, of course, you can use an executor and just iterate as_completed on the list of futures.)

> Not bad (considering the other approaches), but why couldn't it not look just like the sequential one, maybe like this:
> for image in images: 
>     fork create_thumbnail(image)

To me, this strongly implies that you're actually forking a new child process (or at least a new thread) for every thumbnail. Which is probably a really bad idea if you have, say, 1000 of them. It definitely doesn't say "find/create some implicit pool somewhere, wrap this in a task, and submit it to the pool".

And I'm not sure I'd want it to. What if I want to use a process pool instead of a thread pool, or to use a pool of 12 threads instead of NCPU because I know I'm mostly waiting on a particular HTTP server farm and 12x concurrency is ideal?

Also, how would you extend this to return results? A statement can't have a result. And, even if this were an expression, it would look pretty ugly to do this:

    thumbnails = []
    for image in images: 
        thumbnails.append(fork create_thumbnail(image))


    thumbnails = [fork create_thumbnail(image) for image in images]

And, even if you liked the look of that, what exactly could thumbnails be? Obviously not a list of thumbnails. At best, a list of futures that you'd then still have to loop over with as_completed or similar.

Of course you could design a new language with implicit futures built into the core (or, even better, a two-level variable store with dataflow variables and implicit blocking) to solve this, but it would be very different from Python semantics.

> What I like about the Pool concept is that it frees me of thinking about the interprocess/-thread communication and processes/threads management (not sure how this works with coroutines, but the experts of you do know).
> What I would like to be freed of as well is: pool management. It actually reminds me of languages without garbage-collection.

That's a good parallel--but that's exactly what's so nice about "with Pool() as pool:". When you need a pool to be deterministically managed, this is the nicest syntax in any language to do it (except maybe C++ with its RAII, which lets you hide deterministic destruction inside wrapper objects). It's hard to see how it could be any more minimal. After all, if you don't wait on the pool to finish, and you don't collect a bunch of futures to wait on, how do you know when all the thumbnails are created?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150729/aec3187d/attachment.html>

More information about the Python-ideas mailing list