[Python-ideas] Proposal: A simple protocol for generator tasks

Tue Oct 16 09:27:01 CEST 2012

On Mon, Oct 15, 2012 at 11:17 AM, Greg Ewing
<greg.ewing at canterbury.ac.nz> wrote:
> Piet Delport wrote:
>
>> 2. Each value yielded as a "step" represents a scheduling instruction,
>>    or primitive, to be interpreted by the task's scheduler.
>
>
> I don't think this technique should be used to communicate
> with the scheduler, other than *maybe* for a *very* small
> set of operations that are truly primitive -- and even then
> I'm not convinced.

But this is by necessity how the scheduler is *already* being
communicated with, at least for the de facto scheduler instructions like
None, Future, and the other primitives being discussed.

This concept of an "intermediate object yielded by a task to its
scheduler on each step, instructing it how to schedule" is already
unavoidably fundamental to how these tasks / coroutines work: this
proposal is just an attempt to name that concept, and define it more
clearly.

> To begin with, there are some operations that *can't* rely
> on yielded instructions as the only way of invoking them.
> Spawning a task, for example -- there must be some way for
> non-task code to invoke that, otherwise you wouldn't be able
> to get top-level tasks into the system.

I'm definitely not suggesting that this be the *only* way of invoking
operations, or that all operations should be invoked this way.

Certainly, everything that is possible inside this protocol will also be
possible outside of it by directly calling methods on some global
scheduler, but that requires knowing who and what that global scheduler
is.

It's important to note that a globally identifiable scheduler object
might not even exist: it's entirely reasonable, for example, to
implement this entire protocol in Twisted by writing a deferTask(task)
helper that handles generic scheduler instructions (None, Future-alike,
and things like spawn() and sleep()) by just arranging for the
appropriate Twisted callbacks and resumptions to happen under the hood.

(This is basically how Twisted's deferredGenerator works currently: the
main difference is that a deferTask() implementation would be able to
run any generic coroutine / generator task code that uses this protocol,
without that code having to know about Twisted.)

Regarding getting top-level tasks into the system, this can be done in a
variety of ways, depending on how particular applications are
structured. For example, if the stdlib grows a standardized default
event loop:

    tasklib.DefaultScheduler(tasks).start()

or:

    result = tasklib.run(task())

or with existing frameworks like Twisted:

    deferTask(task()).addCallback(consume)
    deferTasks(othertasks)
    reactor.start()

In other words, only the top level of an application should need to
worry about how the initial scheduler, tasks, and everything else are
started.

> Also, consider the operation of unblocking a task that's
> waiting for some event to occur. Often you will want to
> invoke this using a callback from an event loop, which is
> not a generator and can't yield anything to anywhere.

This can be done with a scheduler primitive that obtains a callable to
resume the current task, like the strawman:

    resume = yield tasklib.get_resume()

from the other thread.

However the exact API ends up looking, suspending and resuming tasks are
very fundamental operations, and probably the most worth having as
standardized instructions that any scheduler can implement: a variety of
more powerful abstractions can be generically built on top of them.

> Given that these operations must provide a way of invoking
> them using a plain function call, there is little reason
> to provide a second way using a yielded instruction.

I don't see the former as an argument to avoid supporting the same
operations as standard yielded instructions.

A task can arrange to wait for a Future using plain function calls, or
by yielding it as an instruction (i.e., "result = yield some_future()"):
the ability to do the former should not make the latter any less
desirable.

The advantage of treating certain primitives as yielded scheduler
instructions is that:

- It's generic and scheduler-agnostic: for example, any task can simply
  yield a Future to its scheduler without caring exactly how the
  scheduler arranges for add_done_callback() to resume the task.

- It requires no global coordination: every generator task already has a
  direct line of communication to its immediate scheduler, without
  having to identify itself using handles, task ids, or other
  mechanisms.

In other words, it's the difference between saying:

    h = get_current_task_handle()
    current_scheduler.sleep(h, 10)
    yield
    current_scheduler.suspend(h)
    yield

and, saying:

    yield tasklib.sleep(10)
    yield tasklib.suspend()

where sleep(n) and suspend() are simple generic objects that any
scheduler can recognize and implement, just like how yielded None and
Future values are recognized and implemented.

> In any case, I believe that the public interface for *any*
> scheduler operation should not be a yielded instruction,
> but either a plain function or something called using
> yield-from, for reasons I explained to Guido earlier.

In other words, limiting the allowable set of yielded scheduler
instructions to None, and doing everything else separate API?

This is possible, but it seems like an awful waste of the perfectly good
and dedicated communication channel that already exists between tasks
and their schedulers, in favor of something more complex and indirect.

There's certainly a motivation for global APIs too, as with the
discussion about getting standardized event loops and schedulers in the
stdlib, but I think that is solving a somewhat different problem, and
see this no reason to tie coroutines / generator tasks to those APIs
when simpler, more generic and universal protocol could be defined.

To me, defining locally how a scheduler should behave and respond to
certain yielded types and values is a much more tractable problem than
the question of designing a good global scheduler API that exposes all
the same operations in a way that's portable and usable across many
different application architectures and lifecycles.

> There are problems with allowing multiple schedulers to
> coexist within the one system, especially if yielded
> instructions are the only way to communicate with them.
>
> It might work for instructions to a task's own scheduler
> concerning itself, but some operations need to operate on
> a *different* task, e.g. unblocking a task when the event
> it was waiting for occurs. How do you know which scheduler
> is managing it?

The point of a protocol like this is that there would be no need for
tasks to know which schedulers are managing what: they can limit
themselves to using a generic protocol.

For example, the par() implementation I gave assumes the primitive:

    resume = yield tasklib.get_resume()

to get a callable to resume itself, and can simply pass that callable to
the tasks it spawns: the last child to complete just calls resume() to
resume the parent task in its own scheduler.

In this example, the resume callable contains all the necessary state to
resume that particular task. A particular scheduler could implement this
primitive by sending back a closure like:

    lambda: current_scheduler.schedule(the_task)

In the case of something like deferTask(), there need not even be any
particular long-lived scheduler aside from the transient calls arranged
by deferTask, and all the state would live in the Twisted reactor and
its queues:

    lambda: reactor.callLater(_defertask_iterate, the_task)

As far as the generic protocol is concerned, it does not matter whether
there's a single global scheduler, or multiple schedulers, or no single
scheduler at all: the scheduler side of the protocol is free to be
implemented in many ways, and manage its state however it's convenient.

> And even if you can find out, if you have to control it using yielded
> instructions, you have no way of yielding something to a different
> task's scheduler.

Generally speaking, this should not be necessary: inter-task
communication is a different question to how tasks should communicate
with their immediate scheduler.

Generically controlling the scheduling of different tasks can be done in
many ways:

- The way par() passes its resume callable to its spawned children.

- Using synchronization primitives: for example, an alternative way to
  implement something like par() without direct use of suspend/resume is
  cooperative condition variable or semaphore.

- Using queues, channels, or similar mechanisms to communicate
  information between tasks. (The communicated values can implicitly
  even be scheduler instructions themselves, like a queue of Futures.)

If something cannot be done inside this generator task protocol, you can
of course still step outside of it and use other mechanisms directly,
but that necessarily ties your code to those mechanisms, which may not
be as simple and universal as code that only relies on this protocol.