[Python-ideas] fork

Sat Aug 1 19:36:27 CEST 2015

Thanks everybody for the feedback on 'fork'.

Let me address the issues and specify it further:

1) Process vs. Thread vs. Coroutine

From my understanding, the main fallacy here is that the caller would be able to decide which type of pool is best suited.

Take create_thumbnail as an example. You do not know whether this is cpu-bound or io-bound; you can just make a guess or try it out.

But who knows then? I would say: the callee.

create_thumbnail is cpu-bound when doing the work itself on the machine.
create_thumbnail is io-bound when delegating the work to, say, a web service.

SAME FUNCTIONALITY, SAME NAME, SAME API, DIFFERENT POOLS REQUIRED.

This said, I would propose something like a marking solution:

@cpu_bound
def create_thumbnail(image):
 # impl

@io_bound
def create_thumbnail(image):
 # impl

(coroutines are already marked as such)

From this, the Python interpreter should be able to infer which type of pool is appropriate.

2) Pool size

Do lists have a fixed length? Do I need to define their lengths right from the start? Do I know them in advance?

I think the answers to these questions are obvious. I don't understand why it should be different for the size of the pools. They could grow and shrink depending on the workload and the available resources.

3) Pool Management in General

There is a reason why I hesitate to explicitly manage pools. Our code runs on a plethora of platforms ranging from few to many hardware threads. We actually do not want to integrate platform-specific properties right into the source. The point of having parallelism and concurrency is to squeeze out more of the machines and get better response times. Anything else wouldn't be honest in my opinion (besides from researching and experimenting).

Thus, a practical solution needs to be simple and universal. Explicitly setting the size of the pool is not universal and definitely not easy.

It doesn't need to be perfect. Even if a first draft implementation would simply define pools having exactly 4 processes/threads/coroutines, that would be awesome. Even cutting execution time into half would be an amazing accomplishment.

Maybe, even 'fork' is too complicated. It could work without it given the decorators above. But then, we could not decide whether to run things in parallel or sequentially. I think I do not like that.

4) Keyword 'fork'

Well, first shot. If you have a better one, I am all in for it (4 letters or shorter only ;) )... Or maybe something like 'par' for parallel or 'con' for concurrent.

5) Awaiting the Completion of Something

As Andrew proposed, using the return value should result in blocking.

What if there is no result to wait for?
That one is harder but I think another keyword like 'wait' or 'await' should work here fine.

for image in images:
 fork create_thumbnail(image)
wait
print(get_size_of_thumbnail_dir())

6) Exceptions

As close to sequential execution as possible.

That is, when some function is forked out and raises an exception, it should behave as if it were a normal function call.

for image in images:
 fork create_thumbnail(image) # I would like to see that in my stacktrace

Also true for expressions. '+=' might raise an exception because, say, huge_calculation returns 'None'. Although the actually evaluation of the sum needs to take place only at the print statement, I would like to see the exception raised at the highlighted place:

end_result = 0
for items in items_list:
 end_result += fork huge_calculation(items) # stacktrace for '+=' should be here
print(end_result) # not here

Best,
Sven

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150801/bb058abb/attachment.html>