[Python-ideas] Proposal: A simple protocol for generator tasks

15 Oct 2012

      [This is a lengthy mail; I apologize in advance!]

Hi,

I've been following this discussion with great interest, and would like
to put forward a suggestion that might simplify some of the questions
that are up in the air.

There are several key point being considered: what exactly constitutes a
"coroutine" or "tasklet", what the precise semantics of "yield" and
"yield from" should be, how the stdlib can support different event loops
and reactors, and how exactly Futures, Deferreds, and other APIs fit
into the whole picture.

This mail is mostly about the first point: I think everyone agrees
roughly what a coroutine-style generator is, but there's enough
variation in how they are used, both historically and presently, that
the concept isn't as precise as it should be. This makes them hard to
think and reason about (failing the "BDFL gets headaches" test), and
makes it harder to define the behavior of all the parts that they
interact with, too.

This is a sketch of an attempt to define what constitutes a
generator-based task or coroutine more rigorously: I think that the
essential behavior can be captured in a small protocol, building on the
generator and iterator protocols. If anyone else thinks this is a good
idea, maybe something like this could work its way into a PEP?

(For the sake of this mail, I will use the term "generator task" or
"task" as a straw man term, but feel free to substitute "coroutine", or
whatever the preferred name ends up being.)

Definition
==========

Very informally: A "generator task" is what you get if you take a normal
Python function and replace its blocking calls with "yield from" calls
to equivalent subtasks.

More formally, a "generator task" is a generator that implements an
incremental, multi-step computation, and is intended to be externally
driven to completion by a runner, or "scheduler", until it delivers a
final result.

This driving process happens as follows:

1. A generator task is iterated by its scheduler to yield a series of
   intermediate "step" values.

2. Each value yielded as a "step" represents a scheduling instruction,
   or primitive, to be interpreted by the task's scheduler.

   This scheduling instruction can be None ("just resume this task
   later"), or a variety of other primitives, such as Futures ("resume
   this task with the result of this Future"); see below for more.

3. The scheduler is responsible for interpreting each "step" instruction
   as appropriate, and sending the instruction's result, if any, back to
   the task using send() or throw().

   A scheduler may run a single task to completion, or may multiplex
   execution between many tasks: generator tasks should assume that
   other tasks may have executed while the task was yielding.

4. The generator task completes by successfully returning (raising
   StopIteration), or by raising an exception. The task's caller
   receives this result.

(For the sake of discussion, I use "the scheduler" to refer to whoever
calls the generator task's next/send/throw methods, and "the task's
caller" to refer to whoever receives the task's final result, but this
is not important to the protocol: a task should not care who drives it
or consumes its result, just like an iterator should not.)

Scheduling instructions / primitives
====================================

(This could probably use a better name.)

The protocol is intentionally agnostic about the implementation of
schedulers, event loops, or reactors: as long as they implement the same
set of scheduling primitives, code should work across them.

There multiple ways to accomplish this, but one possibility is to have a
set common, generic instructions in a standard library module such as
"tasklib" (which could also contain things like default scheduler
implementations, helper functions, and so on).

A partial list of possible primitives (the names are all made up, not
serious suggestions):

1. None: The most basic "do nothing" instruction. This just instructs
   the scheduler to resume the yielding task later.

2. Futures: Instruct the scheduler to resume with the future's result.

   Similar types in third-party libraries, such Deferreds, could
   potentially be implemented either natively by a scheduler that
   supports it, or using a wait_for_deferred(d) helper task, or using
   the idea of a "adapter" scheduler (see below).

3. Control primitives: spawn, sleep, etc.

   - Spawn a new (independent) task: yield tasklib.spawn(task())
   - Wait for multiple tasks: (x, y) = yield tasklib.par(foo(), bar())
   - Delay execution: yield tasklib.sleep(seconds)
   - etc.

   These could be simple marker objects, leaving it up to the underlying
   scheduler to actually recognize and implement them; some could also
   be implemented in terms of simpler operations (e.g.  sleep(), in
   terms of lower-level suspend and resume operations).

4. I/O operations

   This could be anything from low-level "yield fd_readable(sock)" style
   requests, or any of the higher-level APIs being discussed elsewhere.

   Whatever the exact API ends up being, the scheduler should implement
   these primitives by waiting for the I/O (or condition), and resuming
   the task with the result, if any.

5. Cooperative concurrency primitives, for working with locks, condition
   variables, and so on. (If useful?)

6. Custom, scheduler-specific instructions: Since a generator task can
   potentially yield anything as a scheduler instruction, it's not
   inconceivable for specialized schedulers to support specialized
   instructions. (Code that relies on such special instructions won't
   work on other schedulers, but that would be the point.)

A question open to debate is what a scheduler should do when faced with
an unrecognized scheduling instruction.

Raising TypeError or NotImplementedError back into the task is probably
a reasonable action, and would allow code like:

    def task():
        try:
            yield fancy_magic_instruction()
        except NotImplementedError:
            yield from boring_fallback()
        ...

Generator tasks as schedulers, and vice versa
=============================================

Note that there is a symmetry to the protocol when a generator task
calls another using "yield from":

    def task()
        spam = yield from subtask()

Here, task() is both a generator task, and the effective scheduler for
subtask(): it "implements" subtask()'s scheduling instructions by
delegating them to its own scheduler.

This is a plain observation on its own, however, it raises one or two
interesting possibilities for more interesting schedulers implemented as
generator tasks themselves, including:

- Specialized sub-schedulers that run as a normal task within their
  parent scheduler, but implement for example weighted or priority
  queuing of their subtasks, or similar features.

- "Adapter" schedulers that intercept special scheduler instructions
  (say, Deferreds or other library-specific objects), and implement them
  using more generic instructions to the underlying scheduler.

-- 
Piet Delport