Proposal: A simple protocol for generator tasks

[This is a lengthy mail; I apologize in advance!] Hi, I've been following this discussion with great interest, and would like to put forward a suggestion that might simplify some of the questions that are up in the air. There are several key point being considered: what exactly constitutes a "coroutine" or "tasklet", what the precise semantics of "yield" and "yield from" should be, how the stdlib can support different event loops and reactors, and how exactly Futures, Deferreds, and other APIs fit into the whole picture. This mail is mostly about the first point: I think everyone agrees roughly what a coroutine-style generator is, but there's enough variation in how they are used, both historically and presently, that the concept isn't as precise as it should be. This makes them hard to think and reason about (failing the "BDFL gets headaches" test), and makes it harder to define the behavior of all the parts that they interact with, too. This is a sketch of an attempt to define what constitutes a generator-based task or coroutine more rigorously: I think that the essential behavior can be captured in a small protocol, building on the generator and iterator protocols. If anyone else thinks this is a good idea, maybe something like this could work its way into a PEP? (For the sake of this mail, I will use the term "generator task" or "task" as a straw man term, but feel free to substitute "coroutine", or whatever the preferred name ends up being.) Definition ========== Very informally: A "generator task" is what you get if you take a normal Python function and replace its blocking calls with "yield from" calls to equivalent subtasks. More formally, a "generator task" is a generator that implements an incremental, multi-step computation, and is intended to be externally driven to completion by a runner, or "scheduler", until it delivers a final result. This driving process happens as follows: 1. A generator task is iterated by its scheduler to yield a series of intermediate "step" values. 2. Each value yielded as a "step" represents a scheduling instruction, or primitive, to be interpreted by the task's scheduler. This scheduling instruction can be None ("just resume this task later"), or a variety of other primitives, such as Futures ("resume this task with the result of this Future"); see below for more. 3. The scheduler is responsible for interpreting each "step" instruction as appropriate, and sending the instruction's result, if any, back to the task using send() or throw(). A scheduler may run a single task to completion, or may multiplex execution between many tasks: generator tasks should assume that other tasks may have executed while the task was yielding. 4. The generator task completes by successfully returning (raising StopIteration), or by raising an exception. The task's caller receives this result. (For the sake of discussion, I use "the scheduler" to refer to whoever calls the generator task's next/send/throw methods, and "the task's caller" to refer to whoever receives the task's final result, but this is not important to the protocol: a task should not care who drives it or consumes its result, just like an iterator should not.) Scheduling instructions / primitives ==================================== (This could probably use a better name.) The protocol is intentionally agnostic about the implementation of schedulers, event loops, or reactors: as long as they implement the same set of scheduling primitives, code should work across them. There multiple ways to accomplish this, but one possibility is to have a set common, generic instructions in a standard library module such as "tasklib" (which could also contain things like default scheduler implementations, helper functions, and so on). A partial list of possible primitives (the names are all made up, not serious suggestions): 1. None: The most basic "do nothing" instruction. This just instructs the scheduler to resume the yielding task later. 2. Futures: Instruct the scheduler to resume with the future's result. Similar types in third-party libraries, such Deferreds, could potentially be implemented either natively by a scheduler that supports it, or using a wait_for_deferred(d) helper task, or using the idea of a "adapter" scheduler (see below). 3. Control primitives: spawn, sleep, etc. - Spawn a new (independent) task: yield tasklib.spawn(task()) - Wait for multiple tasks: (x, y) = yield tasklib.par(foo(), bar()) - Delay execution: yield tasklib.sleep(seconds) - etc. These could be simple marker objects, leaving it up to the underlying scheduler to actually recognize and implement them; some could also be implemented in terms of simpler operations (e.g. sleep(), in terms of lower-level suspend and resume operations). 4. I/O operations This could be anything from low-level "yield fd_readable(sock)" style requests, or any of the higher-level APIs being discussed elsewhere. Whatever the exact API ends up being, the scheduler should implement these primitives by waiting for the I/O (or condition), and resuming the task with the result, if any. 5. Cooperative concurrency primitives, for working with locks, condition variables, and so on. (If useful?) 6. Custom, scheduler-specific instructions: Since a generator task can potentially yield anything as a scheduler instruction, it's not inconceivable for specialized schedulers to support specialized instructions. (Code that relies on such special instructions won't work on other schedulers, but that would be the point.) A question open to debate is what a scheduler should do when faced with an unrecognized scheduling instruction. Raising TypeError or NotImplementedError back into the task is probably a reasonable action, and would allow code like: def task(): try: yield fancy_magic_instruction() except NotImplementedError: yield from boring_fallback() ... Generator tasks as schedulers, and vice versa ============================================= Note that there is a symmetry to the protocol when a generator task calls another using "yield from": def task() spam = yield from subtask() Here, task() is both a generator task, and the effective scheduler for subtask(): it "implements" subtask()'s scheduling instructions by delegating them to its own scheduler. This is a plain observation on its own, however, it raises one or two interesting possibilities for more interesting schedulers implemented as generator tasks themselves, including: - Specialized sub-schedulers that run as a normal task within their parent scheduler, but implement for example weighted or priority queuing of their subtasks, or similar features. - "Adapter" schedulers that intercept special scheduler instructions (say, Deferreds or other library-specific objects), and implement them using more generic instructions to the underlying scheduler. -- Piet Delport

Piet Delport wrote:
2. Each value yielded as a "step" represents a scheduling instruction, or primitive, to be interpreted by the task's scheduler.
I don't think this technique should be used to communicate with the scheduler, other than *maybe* for a *very* small set of operations that are truly primitive -- and even then I'm not convinced. To begin with, there are some operations that *can't* rely on yielded instructions as the only way of invoking them. Spawning a task, for example -- there must be some way for non-task code to invoke that, otherwise you wouldn't be able to get top-level tasks into the system. Also, consider the operation of unblocking a task that's waiting for some event to occur. Often you will want to invoke this using a callback from an event loop, which is not a generator and can't yield anything to anywhere. Given that these operations must provide a way of invoking them using a plain function call, there is little reason to provide a second way using a yielded instruction. In any case, I believe that the public interface for *any* scheduler operation should not be a yielded instruction, but either a plain function or something called using yield-from, for reasons I explained to Guido earlier.
There are problems with allowing multiple schedulers to coexist within the one system, especially if yielded instructions are the only way to communicate with them. It might work for instructions to a task's own scheduler concerning itself, but some operations need to operate on a *different* task, e.g. unblocking a task when the event it was waiting for occurs. How do you know which scheduler is managing it? And even if you can find out, if you have to control it using yielded instructions, you have no way of yielding something to a different task's scheduler. -- Greg

On Mon, Oct 15, 2012 at 11:17 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
But this is by necessity how the scheduler is *already* being communicated with, at least for the de facto scheduler instructions like None, Future, and the other primitives being discussed. This concept of an "intermediate object yielded by a task to its scheduler on each step, instructing it how to schedule" is already unavoidably fundamental to how these tasks / coroutines work: this proposal is just an attempt to name that concept, and define it more clearly.
I'm definitely not suggesting that this be the *only* way of invoking operations, or that all operations should be invoked this way. Certainly, everything that is possible inside this protocol will also be possible outside of it by directly calling methods on some global scheduler, but that requires knowing who and what that global scheduler is. It's important to note that a globally identifiable scheduler object might not even exist: it's entirely reasonable, for example, to implement this entire protocol in Twisted by writing a deferTask(task) helper that handles generic scheduler instructions (None, Future-alike, and things like spawn() and sleep()) by just arranging for the appropriate Twisted callbacks and resumptions to happen under the hood. (This is basically how Twisted's deferredGenerator works currently: the main difference is that a deferTask() implementation would be able to run any generic coroutine / generator task code that uses this protocol, without that code having to know about Twisted.) Regarding getting top-level tasks into the system, this can be done in a variety of ways, depending on how particular applications are structured. For example, if the stdlib grows a standardized default event loop: tasklib.DefaultScheduler(tasks).start() or: result = tasklib.run(task()) or with existing frameworks like Twisted: deferTask(task()).addCallback(consume) deferTasks(othertasks) reactor.start() In other words, only the top level of an application should need to worry about how the initial scheduler, tasks, and everything else are started.
This can be done with a scheduler primitive that obtains a callable to resume the current task, like the strawman: resume = yield tasklib.get_resume() from the other thread. However the exact API ends up looking, suspending and resuming tasks are very fundamental operations, and probably the most worth having as standardized instructions that any scheduler can implement: a variety of more powerful abstractions can be generically built on top of them.
I don't see the former as an argument to avoid supporting the same operations as standard yielded instructions. A task can arrange to wait for a Future using plain function calls, or by yielding it as an instruction (i.e., "result = yield some_future()"): the ability to do the former should not make the latter any less desirable. The advantage of treating certain primitives as yielded scheduler instructions is that: - It's generic and scheduler-agnostic: for example, any task can simply yield a Future to its scheduler without caring exactly how the scheduler arranges for add_done_callback() to resume the task. - It requires no global coordination: every generator task already has a direct line of communication to its immediate scheduler, without having to identify itself using handles, task ids, or other mechanisms. In other words, it's the difference between saying: h = get_current_task_handle() current_scheduler.sleep(h, 10) yield current_scheduler.suspend(h) yield and, saying: yield tasklib.sleep(10) yield tasklib.suspend() where sleep(n) and suspend() are simple generic objects that any scheduler can recognize and implement, just like how yielded None and Future values are recognized and implemented.
In other words, limiting the allowable set of yielded scheduler instructions to None, and doing everything else separate API? This is possible, but it seems like an awful waste of the perfectly good and dedicated communication channel that already exists between tasks and their schedulers, in favor of something more complex and indirect. There's certainly a motivation for global APIs too, as with the discussion about getting standardized event loops and schedulers in the stdlib, but I think that is solving a somewhat different problem, and see this no reason to tie coroutines / generator tasks to those APIs when simpler, more generic and universal protocol could be defined. To me, defining locally how a scheduler should behave and respond to certain yielded types and values is a much more tractable problem than the question of designing a good global scheduler API that exposes all the same operations in a way that's portable and usable across many different application architectures and lifecycles.
The point of a protocol like this is that there would be no need for tasks to know which schedulers are managing what: they can limit themselves to using a generic protocol. For example, the par() implementation I gave assumes the primitive: resume = yield tasklib.get_resume() to get a callable to resume itself, and can simply pass that callable to the tasks it spawns: the last child to complete just calls resume() to resume the parent task in its own scheduler. In this example, the resume callable contains all the necessary state to resume that particular task. A particular scheduler could implement this primitive by sending back a closure like: lambda: current_scheduler.schedule(the_task) In the case of something like deferTask(), there need not even be any particular long-lived scheduler aside from the transient calls arranged by deferTask, and all the state would live in the Twisted reactor and its queues: lambda: reactor.callLater(_defertask_iterate, the_task) As far as the generic protocol is concerned, it does not matter whether there's a single global scheduler, or multiple schedulers, or no single scheduler at all: the scheduler side of the protocol is free to be implemented in many ways, and manage its state however it's convenient.
Generally speaking, this should not be necessary: inter-task communication is a different question to how tasks should communicate with their immediate scheduler. Generically controlling the scheduling of different tasks can be done in many ways: - The way par() passes its resume callable to its spawned children. - Using synchronization primitives: for example, an alternative way to implement something like par() without direct use of suspend/resume is cooperative condition variable or semaphore. - Using queues, channels, or similar mechanisms to communicate information between tasks. (The communicated values can implicitly even be scheduler instructions themselves, like a queue of Futures.) If something cannot be done inside this generator task protocol, you can of course still step outside of it and use other mechanisms directly, but that necessarily ties your code to those mechanisms, which may not be as simple and universal as code that only relies on this protocol.

On Sun, Oct 14, 2012 at 11:36 PM, Piet Delport <pjdelport@gmail.com> wrote:
[This is a lengthy mail; I apologize in advance!]
This is what I get for deciding to check up on these threads at 6AM after a late night.
I like that "task" is more general and avoids complaints from some that these are not "real" coroutines.
"yield" and "yield from", although I'm really disliking the second being included at all. More on this later.
What is the difference between the tossed around "yield from task()" and this "yield tasklib.spawn(task())" And, why isn't it simply spelled "yield task()"? You have all these different types that can be yielded to the scheduler from tasks to the scheduler. Why isn't a task one of those possible types? If the scheduler gets an iterator, it should schedule it automatically.
I am sure these will come about, but I think that is considered a library that sits on top of whatever API comes out, not part of it.
Interesting. Can anyone think of an example of this?
As raised above, why not simply "yield subtask()"?
I think that is too messy, you could have so many different scheduler semantics. Maybe this sort of thing is what your schedule-specific instructions should be for. Or, attributes on tasks that schedulers can be known to look for.
I think we can make yielding tasks a direct operation, and still implment sub-schedulers. They should be more opaque, I think.
-- Read my blog! I depend on your acceptance of my opinion! I am interesting! http://techblog.ironfroggy.com/ Follow me if you're into that sort of thing: http://www.twitter.com/ironfroggy

On Mon, Oct 15, 2012 at 12:48 PM, Calvin Spealman <ironfroggy@gmail.com> wrote:
What is the difference between the tossed around "yield from task()" and this "yield tasklib.spawn(task())"
"yield from task()" is simply the coroutine / task version of a function call: it runs the task to completion, and returns its final result. "yield tasklib.spawn(task())" (or however it ends up being spelled) would be a scheduler primitive to start a task *without* waiting for its result: in other words, it's a request that the scheduler start a new, independent thread of control.
This is a good question: I stopped short of discussing it in the original message only to keep it short, and in the hope that the answer is implied. The short answer is that "yield task()" is the old, hacky, cumbersome, "legacy"[1] way of calling subtasks, and that "yield from" should entirely replace the need to have to support it. Before "yield from", "yield task()" was the only to call subtasks, but this approach has some major disadvantages: 1. In order for it to work, schedulers must manually implement task trampolining, which is ugly at best, and prone to bugs if not all edge cases are handled correctly. (IOW, it effectively places the burden of implementing PEP 380 onto each scheduler.) 2. It obfuscates exception tracebacks by default, requiring schedulers that want readable stack traces to take additional pains to clean up their own non-task frames, while propagating exceptions. 3. It requires schedulers to reliably distinguish between tasks and other primitives in the first place. Simply treating all iterators as tasks is not sufficient: to run a task, you need send() and throw(), at least. (Type-checking for GeneratorType would be marginally better, but would unnecessarily preclude for example implementing tasks as classes or C extension types, which is otherwise entirely possible with this protocol.) "yield from" simplifies and solves all these problems in elegant swoop: 1. No more manual trampolining: a scheduler can treat any task as a single unit, and only needs to worry about the single, combined stream of instructions coming from it. 2. Tracebacks (and return values) take care of themselves, as they should. 3. By separating the concerns of direct scheduler communication ("yield") and subtask delegation ("yield from"), schedulers can limit themselves to just knowing about scheduler primitives when dealing yielded values, which should be more easily and tightly defined than the full spectrum of tasks in general. (The set of officially-defined scheduler instructions could end up being as small as None and Future, say.) In summary, it's entirely possible for schedulers to continue supporting the old "yield task()" way of calling subtasks (and this has no problem fitting into the proposed protocol[2]), but there should be no reason to do so, and several good reasons not to: hopefully, it will become a pre-3.3 historical footnote. [1] For the purposes of this email, interpret "legacy" to mean "older than 17 days". :) [2] Interpreted as a scheduler instruction, a task value would simply mean "resume the current task with the result of completing the yielded subtask" (modulo the practical question of reliably type-checking tasks, as mentioned).
I just want to note for the record that I'm not *encouraging* this kind of thing: I'm just just observing that it would be allowed by the protocol. (However, one imaginable use case would be for tasks to send scheduler-specific hints, that can safely be ignored when those tasks are running on other scheduler implementations.)
It shouldn't get messy: the core semantics of any scheduler should always stay within the proposed protocol. The above is not the best example of a custom scheduler, though. Perhaps a better example would be a generic helper function like the following, that implements throttling throttling of I/O requests made through it: def task(): result = yield from io_throttled(subtask(), rate=foo) io_throttled() would end up sitting between task() and subtask() in the hierarchy, like so: ... -> task() -> io_throttled() -> subtask() -> ... To recap, each task is implicitly driven by the scheduler above it, and implicitly drives the task(s) below it: The outer scheduler drives task(), which drives io_throttled(), which drives subtask(), and so on. In this picture: "yield from" is the "most default" scheduler: it simply delegates all yielded instructions to the outer scheduler. However, instead of relying on "yield from", io_throttled() can dip down into the task protocol itself, and drive subtask() directly. This would allow it to inspect and manipulate the underlying instructions instructions and responses flowing back and forth, and, assuming that there's a recognizable standard representation for I/O primitives, it could keep track of the rate of I/O, and insert delay instructions as necessary (or something similar). The key observations I want to make: * io_throttled() is not special: it is just a normal task, as far as the tasks above and below it are concerned, and assumes only a recognizable representation of the fundamental I/O and delay instructions used. * To the extent that said underlying primitives are scheduler-agnostic, io_throttled() can be used or inserted anywhere, without caring how the underlying scheduler or event loop handles I/O, or how its global API looks. It just acts locally, in terms of the task protocol. An example where this kind of thing might actually be useful is an application or library that wishes to throttle, say, certain HTTP requests: it could simply internally wrap the tasks that make those requests in io_throttled(), without any special support from the underlying scheduler. This is of course not the only way to solve this particular problem, but it's an example of how thinking about generator tasks and their schedulers as two sides of the same underlying protocol could be a powerful abstraction, enabling a compositional approach to combining implementations of the protocol that might not be obvious or possible otherwise. -- Piet Delport

Hi Piet, i like that finally someone is pointing out how to deal with the *concurrent* part i have some further notes * greenlet interaction wanted since interacting with greenlets is slightly different from generators * they don’t get the function arguments at greenlet creation time, but on the first `switch` generator outer use: gn = f(*arg, **kwarg) gn.next() greenlet outer use: gr = greenlet.greenlet(f) gr.switch(*args, **kw) * instead of send/next, they always use switch * `yield` is a function call -> there is need for a lib to manage the local part of greenlet operations in any case (so we should just ensure that the scheduler can handle their way if `yield`, but not actually have support/compat code in the stdlib for their yielding) * considering regular classes for interaction since for some protocol implementations different means might make sense (this could also be used for the scheduler part of greenlet interaction) result -> a protocol for cooperative concurrency * considering the upcoming pypy transaction module/stm since using that right could mean "free" parallelism in future * alternatives for queues/channels are needed * pools/rate-limiters and other exercises are needed as well * some kind of default tools for servers are needed * the stdlib could have a very simple default scheduler that’s just doing something basic like run all work it can do, and if it cant block on a io reactor we just need something that can run() after all has been created having an api like sheduler.add(gen) would be a plus (since it would be just like pypy's transaction module) an example i have in mind is something like sheduler.add(...) sheduler.add(...) sheduler.run() If things go as I planned on my side, starting in jan/feb 2013 i'll try a prototype implementation for further comments/actual experimentation. -- Ronny On 10/15/2012 05:36 AM, Piet Delport wrote:

Piet Delport wrote:
2. Each value yielded as a "step" represents a scheduling instruction, or primitive, to be interpreted by the task's scheduler.
I don't think this technique should be used to communicate with the scheduler, other than *maybe* for a *very* small set of operations that are truly primitive -- and even then I'm not convinced. To begin with, there are some operations that *can't* rely on yielded instructions as the only way of invoking them. Spawning a task, for example -- there must be some way for non-task code to invoke that, otherwise you wouldn't be able to get top-level tasks into the system. Also, consider the operation of unblocking a task that's waiting for some event to occur. Often you will want to invoke this using a callback from an event loop, which is not a generator and can't yield anything to anywhere. Given that these operations must provide a way of invoking them using a plain function call, there is little reason to provide a second way using a yielded instruction. In any case, I believe that the public interface for *any* scheduler operation should not be a yielded instruction, but either a plain function or something called using yield-from, for reasons I explained to Guido earlier.
There are problems with allowing multiple schedulers to coexist within the one system, especially if yielded instructions are the only way to communicate with them. It might work for instructions to a task's own scheduler concerning itself, but some operations need to operate on a *different* task, e.g. unblocking a task when the event it was waiting for occurs. How do you know which scheduler is managing it? And even if you can find out, if you have to control it using yielded instructions, you have no way of yielding something to a different task's scheduler. -- Greg

On Mon, Oct 15, 2012 at 11:17 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
But this is by necessity how the scheduler is *already* being communicated with, at least for the de facto scheduler instructions like None, Future, and the other primitives being discussed. This concept of an "intermediate object yielded by a task to its scheduler on each step, instructing it how to schedule" is already unavoidably fundamental to how these tasks / coroutines work: this proposal is just an attempt to name that concept, and define it more clearly.
I'm definitely not suggesting that this be the *only* way of invoking operations, or that all operations should be invoked this way. Certainly, everything that is possible inside this protocol will also be possible outside of it by directly calling methods on some global scheduler, but that requires knowing who and what that global scheduler is. It's important to note that a globally identifiable scheduler object might not even exist: it's entirely reasonable, for example, to implement this entire protocol in Twisted by writing a deferTask(task) helper that handles generic scheduler instructions (None, Future-alike, and things like spawn() and sleep()) by just arranging for the appropriate Twisted callbacks and resumptions to happen under the hood. (This is basically how Twisted's deferredGenerator works currently: the main difference is that a deferTask() implementation would be able to run any generic coroutine / generator task code that uses this protocol, without that code having to know about Twisted.) Regarding getting top-level tasks into the system, this can be done in a variety of ways, depending on how particular applications are structured. For example, if the stdlib grows a standardized default event loop: tasklib.DefaultScheduler(tasks).start() or: result = tasklib.run(task()) or with existing frameworks like Twisted: deferTask(task()).addCallback(consume) deferTasks(othertasks) reactor.start() In other words, only the top level of an application should need to worry about how the initial scheduler, tasks, and everything else are started.
This can be done with a scheduler primitive that obtains a callable to resume the current task, like the strawman: resume = yield tasklib.get_resume() from the other thread. However the exact API ends up looking, suspending and resuming tasks are very fundamental operations, and probably the most worth having as standardized instructions that any scheduler can implement: a variety of more powerful abstractions can be generically built on top of them.
I don't see the former as an argument to avoid supporting the same operations as standard yielded instructions. A task can arrange to wait for a Future using plain function calls, or by yielding it as an instruction (i.e., "result = yield some_future()"): the ability to do the former should not make the latter any less desirable. The advantage of treating certain primitives as yielded scheduler instructions is that: - It's generic and scheduler-agnostic: for example, any task can simply yield a Future to its scheduler without caring exactly how the scheduler arranges for add_done_callback() to resume the task. - It requires no global coordination: every generator task already has a direct line of communication to its immediate scheduler, without having to identify itself using handles, task ids, or other mechanisms. In other words, it's the difference between saying: h = get_current_task_handle() current_scheduler.sleep(h, 10) yield current_scheduler.suspend(h) yield and, saying: yield tasklib.sleep(10) yield tasklib.suspend() where sleep(n) and suspend() are simple generic objects that any scheduler can recognize and implement, just like how yielded None and Future values are recognized and implemented.
In other words, limiting the allowable set of yielded scheduler instructions to None, and doing everything else separate API? This is possible, but it seems like an awful waste of the perfectly good and dedicated communication channel that already exists between tasks and their schedulers, in favor of something more complex and indirect. There's certainly a motivation for global APIs too, as with the discussion about getting standardized event loops and schedulers in the stdlib, but I think that is solving a somewhat different problem, and see this no reason to tie coroutines / generator tasks to those APIs when simpler, more generic and universal protocol could be defined. To me, defining locally how a scheduler should behave and respond to certain yielded types and values is a much more tractable problem than the question of designing a good global scheduler API that exposes all the same operations in a way that's portable and usable across many different application architectures and lifecycles.
The point of a protocol like this is that there would be no need for tasks to know which schedulers are managing what: they can limit themselves to using a generic protocol. For example, the par() implementation I gave assumes the primitive: resume = yield tasklib.get_resume() to get a callable to resume itself, and can simply pass that callable to the tasks it spawns: the last child to complete just calls resume() to resume the parent task in its own scheduler. In this example, the resume callable contains all the necessary state to resume that particular task. A particular scheduler could implement this primitive by sending back a closure like: lambda: current_scheduler.schedule(the_task) In the case of something like deferTask(), there need not even be any particular long-lived scheduler aside from the transient calls arranged by deferTask, and all the state would live in the Twisted reactor and its queues: lambda: reactor.callLater(_defertask_iterate, the_task) As far as the generic protocol is concerned, it does not matter whether there's a single global scheduler, or multiple schedulers, or no single scheduler at all: the scheduler side of the protocol is free to be implemented in many ways, and manage its state however it's convenient.
Generally speaking, this should not be necessary: inter-task communication is a different question to how tasks should communicate with their immediate scheduler. Generically controlling the scheduling of different tasks can be done in many ways: - The way par() passes its resume callable to its spawned children. - Using synchronization primitives: for example, an alternative way to implement something like par() without direct use of suspend/resume is cooperative condition variable or semaphore. - Using queues, channels, or similar mechanisms to communicate information between tasks. (The communicated values can implicitly even be scheduler instructions themselves, like a queue of Futures.) If something cannot be done inside this generator task protocol, you can of course still step outside of it and use other mechanisms directly, but that necessarily ties your code to those mechanisms, which may not be as simple and universal as code that only relies on this protocol.

On Sun, Oct 14, 2012 at 11:36 PM, Piet Delport <pjdelport@gmail.com> wrote:
[This is a lengthy mail; I apologize in advance!]
This is what I get for deciding to check up on these threads at 6AM after a late night.
I like that "task" is more general and avoids complaints from some that these are not "real" coroutines.
"yield" and "yield from", although I'm really disliking the second being included at all. More on this later.
What is the difference between the tossed around "yield from task()" and this "yield tasklib.spawn(task())" And, why isn't it simply spelled "yield task()"? You have all these different types that can be yielded to the scheduler from tasks to the scheduler. Why isn't a task one of those possible types? If the scheduler gets an iterator, it should schedule it automatically.
I am sure these will come about, but I think that is considered a library that sits on top of whatever API comes out, not part of it.
Interesting. Can anyone think of an example of this?
As raised above, why not simply "yield subtask()"?
I think that is too messy, you could have so many different scheduler semantics. Maybe this sort of thing is what your schedule-specific instructions should be for. Or, attributes on tasks that schedulers can be known to look for.
I think we can make yielding tasks a direct operation, and still implment sub-schedulers. They should be more opaque, I think.
-- Read my blog! I depend on your acceptance of my opinion! I am interesting! http://techblog.ironfroggy.com/ Follow me if you're into that sort of thing: http://www.twitter.com/ironfroggy

On Mon, Oct 15, 2012 at 12:48 PM, Calvin Spealman <ironfroggy@gmail.com> wrote:
What is the difference between the tossed around "yield from task()" and this "yield tasklib.spawn(task())"
"yield from task()" is simply the coroutine / task version of a function call: it runs the task to completion, and returns its final result. "yield tasklib.spawn(task())" (or however it ends up being spelled) would be a scheduler primitive to start a task *without* waiting for its result: in other words, it's a request that the scheduler start a new, independent thread of control.
This is a good question: I stopped short of discussing it in the original message only to keep it short, and in the hope that the answer is implied. The short answer is that "yield task()" is the old, hacky, cumbersome, "legacy"[1] way of calling subtasks, and that "yield from" should entirely replace the need to have to support it. Before "yield from", "yield task()" was the only to call subtasks, but this approach has some major disadvantages: 1. In order for it to work, schedulers must manually implement task trampolining, which is ugly at best, and prone to bugs if not all edge cases are handled correctly. (IOW, it effectively places the burden of implementing PEP 380 onto each scheduler.) 2. It obfuscates exception tracebacks by default, requiring schedulers that want readable stack traces to take additional pains to clean up their own non-task frames, while propagating exceptions. 3. It requires schedulers to reliably distinguish between tasks and other primitives in the first place. Simply treating all iterators as tasks is not sufficient: to run a task, you need send() and throw(), at least. (Type-checking for GeneratorType would be marginally better, but would unnecessarily preclude for example implementing tasks as classes or C extension types, which is otherwise entirely possible with this protocol.) "yield from" simplifies and solves all these problems in elegant swoop: 1. No more manual trampolining: a scheduler can treat any task as a single unit, and only needs to worry about the single, combined stream of instructions coming from it. 2. Tracebacks (and return values) take care of themselves, as they should. 3. By separating the concerns of direct scheduler communication ("yield") and subtask delegation ("yield from"), schedulers can limit themselves to just knowing about scheduler primitives when dealing yielded values, which should be more easily and tightly defined than the full spectrum of tasks in general. (The set of officially-defined scheduler instructions could end up being as small as None and Future, say.) In summary, it's entirely possible for schedulers to continue supporting the old "yield task()" way of calling subtasks (and this has no problem fitting into the proposed protocol[2]), but there should be no reason to do so, and several good reasons not to: hopefully, it will become a pre-3.3 historical footnote. [1] For the purposes of this email, interpret "legacy" to mean "older than 17 days". :) [2] Interpreted as a scheduler instruction, a task value would simply mean "resume the current task with the result of completing the yielded subtask" (modulo the practical question of reliably type-checking tasks, as mentioned).
I just want to note for the record that I'm not *encouraging* this kind of thing: I'm just just observing that it would be allowed by the protocol. (However, one imaginable use case would be for tasks to send scheduler-specific hints, that can safely be ignored when those tasks are running on other scheduler implementations.)
It shouldn't get messy: the core semantics of any scheduler should always stay within the proposed protocol. The above is not the best example of a custom scheduler, though. Perhaps a better example would be a generic helper function like the following, that implements throttling throttling of I/O requests made through it: def task(): result = yield from io_throttled(subtask(), rate=foo) io_throttled() would end up sitting between task() and subtask() in the hierarchy, like so: ... -> task() -> io_throttled() -> subtask() -> ... To recap, each task is implicitly driven by the scheduler above it, and implicitly drives the task(s) below it: The outer scheduler drives task(), which drives io_throttled(), which drives subtask(), and so on. In this picture: "yield from" is the "most default" scheduler: it simply delegates all yielded instructions to the outer scheduler. However, instead of relying on "yield from", io_throttled() can dip down into the task protocol itself, and drive subtask() directly. This would allow it to inspect and manipulate the underlying instructions instructions and responses flowing back and forth, and, assuming that there's a recognizable standard representation for I/O primitives, it could keep track of the rate of I/O, and insert delay instructions as necessary (or something similar). The key observations I want to make: * io_throttled() is not special: it is just a normal task, as far as the tasks above and below it are concerned, and assumes only a recognizable representation of the fundamental I/O and delay instructions used. * To the extent that said underlying primitives are scheduler-agnostic, io_throttled() can be used or inserted anywhere, without caring how the underlying scheduler or event loop handles I/O, or how its global API looks. It just acts locally, in terms of the task protocol. An example where this kind of thing might actually be useful is an application or library that wishes to throttle, say, certain HTTP requests: it could simply internally wrap the tasks that make those requests in io_throttled(), without any special support from the underlying scheduler. This is of course not the only way to solve this particular problem, but it's an example of how thinking about generator tasks and their schedulers as two sides of the same underlying protocol could be a powerful abstraction, enabling a compositional approach to combining implementations of the protocol that might not be obvious or possible otherwise. -- Piet Delport

Hi Piet, i like that finally someone is pointing out how to deal with the *concurrent* part i have some further notes * greenlet interaction wanted since interacting with greenlets is slightly different from generators * they don’t get the function arguments at greenlet creation time, but on the first `switch` generator outer use: gn = f(*arg, **kwarg) gn.next() greenlet outer use: gr = greenlet.greenlet(f) gr.switch(*args, **kw) * instead of send/next, they always use switch * `yield` is a function call -> there is need for a lib to manage the local part of greenlet operations in any case (so we should just ensure that the scheduler can handle their way if `yield`, but not actually have support/compat code in the stdlib for their yielding) * considering regular classes for interaction since for some protocol implementations different means might make sense (this could also be used for the scheduler part of greenlet interaction) result -> a protocol for cooperative concurrency * considering the upcoming pypy transaction module/stm since using that right could mean "free" parallelism in future * alternatives for queues/channels are needed * pools/rate-limiters and other exercises are needed as well * some kind of default tools for servers are needed * the stdlib could have a very simple default scheduler that’s just doing something basic like run all work it can do, and if it cant block on a io reactor we just need something that can run() after all has been created having an api like sheduler.add(gen) would be a plus (since it would be just like pypy's transaction module) an example i have in mind is something like sheduler.add(...) sheduler.add(...) sheduler.run() If things go as I planned on my side, starting in jan/feb 2013 i'll try a prototype implementation for further comments/actual experimentation. -- Ronny On 10/15/2012 05:36 AM, Piet Delport wrote:
participants (4)
-
Calvin Spealman
-
Greg Ewing
-
Piet Delport
-
Ronny Pfannschmidt