[Python-ideas] Proposal: A simple protocol for generator tasks

Wed Oct 17 00:56:44 CEST 2012

On Mon, Oct 15, 2012 at 12:48 PM, Calvin Spealman <ironfroggy at gmail.com> wrote:
>
> What is the difference between the tossed around "yield from task()"
> and this "yield tasklib.spawn(task())"

"yield from task()" is simply the coroutine / task version of a function
call: it runs the task to completion, and returns its final result.

"yield tasklib.spawn(task())" (or however it ends up being spelled)
would be a scheduler primitive to start a task *without* waiting for its
result: in other words, it's a request that the scheduler start a new,
independent thread of control.

> And, why isn't it simply spelled "yield task()"? You have all these different
> types that can be yielded to the scheduler from tasks to the scheduler. Why
> isn't a task one of those possible types? If the scheduler gets an iterator, it
> should schedule it automatically.

This is a good question: I stopped short of discussing it in the
original message only to keep it short, and in the hope that the answer
is implied.

The short answer is that "yield task()" is the old, hacky, cumbersome,
"legacy"[1] way of calling subtasks, and that "yield from" should
entirely replace the need to have to support it.

Before "yield from", "yield task()" was the only to call subtasks, but
this approach has some major disadvantages:

1. In order for it to work, schedulers must manually implement task
   trampolining, which is ugly at best, and prone to bugs if not all
   edge cases are handled correctly. (IOW, it effectively places the
   burden of implementing PEP 380 onto each scheduler.)

2. It obfuscates exception tracebacks by default, requiring schedulers
   that want readable stack traces to take additional pains to clean up
   their own non-task frames, while propagating exceptions.

3. It requires schedulers to reliably distinguish between tasks and
   other primitives in the first place.

   Simply treating all iterators as tasks is not sufficient: to run a
   task, you need send() and throw(), at least. (Type-checking for
   GeneratorType would be marginally better, but would unnecessarily
   preclude for example implementing tasks as classes or C extension
   types, which is otherwise entirely possible with this protocol.)

"yield from" simplifies and solves all these problems in elegant swoop:

1. No more manual trampolining: a scheduler can treat any task as a
   single unit, and only needs to worry about the single, combined
   stream of instructions coming from it.

2. Tracebacks (and return values) take care of themselves, as they
   should.

3. By separating the concerns of direct scheduler communication
   ("yield") and subtask delegation ("yield from"), schedulers can limit
   themselves to just knowing about scheduler primitives when dealing
   yielded values, which should be more easily and tightly defined than
   the full spectrum of tasks in general. (The set of officially-defined
   scheduler instructions could end up being as small as None and
   Future, say.)

In summary, it's entirely possible for schedulers to continue supporting
the old "yield task()" way of calling subtasks (and this has no problem
fitting into the proposed protocol[2]), but there should be no reason to
do so, and several good reasons not to: hopefully, it will become a
pre-3.3 historical footnote.

[1] For the purposes of this email, interpret "legacy" to mean "older
    than 17 days". :)

[2] Interpreted as a scheduler instruction, a task value would simply
    mean "resume the current task with the result of completing the
    yielded subtask" (modulo the practical question of reliably
    type-checking tasks, as mentioned).

>> Raising TypeError or NotImplementedError back into the task is probably
>> a reasonable action, and would allow code like:
>>
>>     def task():
>>         try:
>>             yield fancy_magic_instruction()
>>         except NotImplementedError:
>>             yield from boring_fallback()
>>         ...
>
> Interesting. Can anyone think of an example of this?

I just want to note for the record that I'm not *encouraging* this kind
of thing: I'm just just observing that it would be allowed by the
protocol.

(However, one imaginable use case would be for tasks to send
scheduler-specific hints, that can safely be ignored when those tasks
are running on other scheduler implementations.)

>> This is a plain observation on its own, however, it raises one or two
>> interesting possibilities for more interesting schedulers implemented as
>> generator tasks themselves, including:
>>
>> - Specialized sub-schedulers that run as a normal task within their
>>   parent scheduler, but implement for example weighted or priority
>>   queuing of their subtasks, or similar features.
>
> I think that is too messy, you could have so many different scheduler
> semantics. Maybe this sort of thing is what your schedule-specific
> instructions should be for.

It shouldn't get messy: the core semantics of any scheduler should
always stay within the proposed protocol.

The above is not the best example of a custom scheduler, though.
Perhaps a better example would be a generic helper function like the
following, that implements throttling throttling of I/O requests made
through it:

    def task():
        result = yield from io_throttled(subtask(), rate=foo)

io_throttled() would end up sitting between task() and subtask() in the
hierarchy, like so:

    ... -> task() -> io_throttled() -> subtask() -> ...

To recap, each task is implicitly driven by the scheduler above it, and
implicitly drives the task(s) below it: The outer scheduler drives
task(), which drives io_throttled(), which drives subtask(), and so on.

In this picture: "yield from" is the "most default" scheduler: it simply
delegates all yielded instructions to the outer scheduler.

However, instead of relying on "yield from", io_throttled() can dip down
into the task protocol itself, and drive subtask() directly. This would
allow it to inspect and manipulate the underlying instructions
instructions and responses flowing back and forth, and, assuming that
there's a recognizable standard representation for I/O primitives, it
could keep track of the rate of I/O, and insert delay instructions as
necessary (or something similar).

The key observations I want to make:

* io_throttled() is not special: it is just a normal task, as far as the
  tasks above and below it are concerned, and assumes only a
  recognizable representation of the fundamental I/O and delay
  instructions used.

* To the extent that said underlying primitives are scheduler-agnostic,
  io_throttled() can be used or inserted anywhere, without caring how
  the underlying scheduler or event loop handles I/O, or how its global
  API looks. It just acts locally, in terms of the task protocol.

An example where this kind of thing might actually be useful is an
application or library that wishes to throttle, say, certain HTTP
requests: it could simply internally wrap the tasks that make those
requests in io_throttled(), without any special support from the
underlying scheduler.

This is of course not the only way to solve this particular problem, but
it's an example of how thinking about generator tasks and their
schedulers as two sides of the same underlying protocol could be a
powerful abstraction, enabling a compositional approach to combining
implementations of the protocol that might not be obvious or possible
otherwise.

-- 
Piet Delport