[Python-ideas] New PEP 550: Execution Context

Sat Aug 12 03:54:03 EDT 2017

Hi Yury,

This is really cool. Some notes on a first read:

1. Excellent work on optimizing dict, that seems valuable independent
of the rest of the details here.

2. The text doesn't mention async generators at all. I assume they
also have an agi_isolated_execution_context flag that can be set, to
enable @asyncontextmanager?

2a. Speaking of which I wonder if it's possible for async_generator to
emulate this flag... I don't know if this matters -- at this point the
main reason to use async_generator is for code that wants to support
PyPy. If PyPy gains native async generator support before CPython 3.7
comes out then async_generator may be entirely irrelevant before PEP
550 matters. But right now async_generator is still quite handy...

2b. BTW, the contextmanager trick is quite nice -- I actually noticed
last week that PEP 521 had a problem here, but didn't think of a
solution :-).

3. You're right that numpy is *very* performance sensitive about
accessing the context -- the errstate object is needed extremely
frequently, even on trivial operations like adding two scalars, so a
dict lookup is very noticeable. (Imagine adding a dict lookup to
float.__add__.) Right now, the errstate object get stored in the
threadstate dict, and then there are some dubious-looking hacks
involving a global (not thread-local) counter to let us skip the
lookup entirely if we think that no errstate object has been set.
Really what we ought to be doing (currently, in a non PEP 550 world)
is storing the errstate in a __thread variable -- it'd certainly be
worth it. Adopting PEP 550 would definitely be easier if we knew that
it wasn't ruling out that level of optimization.

4. I'm worried that all of your examples use string keys. One of the
great things about threading.local objects is that each one is a new
namespace, which is a honking great idea -- here it prevents
accidental collisions between unrelated libraries. And while it's
possible to implement threading.local in terms of the threadstate dict
(that's how they work now!), it requires some extremely finicky code
to get the memory management right:

https://github.com/python/cpython/blob/dadca480c5b7c5cf425d423316cd695bc5db3023/Modules/_threadmodule.c#L558-L595

It seems like you're imagining that this API will be used directly by
user code? Is that true? ...Are you sure that's a good idea? Are we
just assuming that not many keys will be used and the keys will
generally be immortal anyway, so leaking entries is OK? Maybe this is
nit-picking, but this is hooking into the language semantics in such a
deep way that I sorta feel like it would be bad to end up with
something where we can never get garbage collection right.

The suggested index-based API for super fast C lookup also has this
problem, but that would be such a low-level API -- and not part of the
language definition -- that the right answer is probably just to
document that there's no way to unallocate indices so any given C
library should only allocate, like... 1 of them. Maybe provide an
explicit API to release an index, if we really want to get fancy.

5. Is there some performance-related reason that the API for
getting/setting isn't just sys.get_execution_context()[...] = ...? Or
even sys.execution_context[...]?

5a. Speaking of which I'm not a big fan of the None-means-delete
behavior. Not only does Python have a nice standard way to describe
all the mapping operations without such hacks, but you're actually
implementing that whole interface anyway. Why not use it?

6. Should Thread.start inherit the execution context from the spawning thread?

7. Compatibility: it does sort of break 3rd party contextmanager
implementations (contextlib2, asyncio_extras's acontextmanager, trio's
internal acontextmanager, ...). This is extremely minor though.

8. You discuss how this works for asyncio and gevent. Have you looked
at how it will interact with tornado's context handling system? Can
they use this? It's the most important extant context implementation I
can think of (aside from thread local storage itself).

9. OK, my big question, about semantics.

The PEP's design is based on the assumption that all context-local
state is scalar-like, and contexts split but never join. But there are
some cases where this isn't true, in particular for values that have
"stack-like" semantics. These are terms I just made up, but let me
give some examples. Python's sys.exc_info is one. Another I ran into
recently is for trio's cancel scopes.

So basically the background is, in trio you can wrap a context manager
around any arbitrary chunk of code and then set a timeout or
explicitly cancel that code. It's called a "cancel scope". These are
fully nestable. Full details here:
https://trio.readthedocs.io/en/latest/reference-core.html#cancellation-and-timeouts

Currently, the implementation involves keeping a stack of cancel
scopes in Task-local storage. This works fine for regular async code
because when we switch Tasks, we also switch the cancel scope stack.
But of course it falls apart for generators/async generators:

async def agen():
    with fail_after(10):  # 10 second timeout for finishing this block
        await some_blocking_operation()
        yield
        await another_blocking_operation()

async def caller():
    with fail_after(20):
        ag = agen()
        await ag.__anext__()
        # now that cancel scope is on the stack, even though we're not
        # inside the context manager! this will not end well.
        await some_blocking_operation()  # this might get cancelled
when it shouldn't
    # even if it doesn't, we'll crash here when exiting the context manager
    # because we try to pop a cancel scope that isn't at the top of the stack

So I was thinking about whether I could implement this using PEP 550.
It requires some cleverness, but I could switch to representing the
stack as a singly-linked list, and then snapshot it and pass it back
to the coroutine runner every time I yield. That would fix the case
above. But, I think there's another case that's kind of a showstopper.

async def agen():
    await some_blocking_operation()
    yield

async def caller():
    ag = agen()  # context is captured here
    with fail_after(10):
        await ag.__anext__()

Currently this case works correctly: the timeout is applied to the
__anext__ call, as you'd expect. But with PEP 550, it wouldn't work:
the generator's timeouts would all be fixed when it was instantiated,
and we wouldn't be able to detect that the second call has a timeout
imposed on it. So that's a pretty nasty footgun. Any time you have
code that's supposed to have a timeout applied, but in fact has no
timeout applied, then that's a really serious bug -- it can lead to
hangs, trivial DoS, pagers going off, etc.

Another problem is code like:

async def caller():
    with fail_after(10):
        ag = agen()
   # then exit the scope

Can we clean up the cancel scope? (e.g., remove it from the global
priority queue that tracks timeouts?) Normally yes, that's what
__exit__ blocks are for, letting you know deterministically that an
object can be cleaned up. But here it got captured by the async
generator. I really don't want to have to rely on the GC, because on
PyPy it means that we could leak an unbounded number of cancel scopes
for a finite but unbounded number of time, and all those extra entries
in the global timeout priority queue aren't free.

(And sys.exc_info has had buggy behavior in analogous situations.)

So, I'm wondering if you (or anyone) have any ideas how to fix this
:-). Technically, PEP 521 is powerful enough to do it, but in practice
the performance would be catastrophically bad. It's one thing to have
some extra cost to yielding out of an np.errstate block, those are
rare and yielding out of them is rare. But cancel scopes are
different: essentially all code in trio runs inside one or more of
them, so every coroutine suspend/resume would have to call all those
suspend/resume hooks up and down the stack. OTOH PEP 550 is fast, but
AFAICT its semantics are wrong for this use case.

The basic invariant I want is: if at any given moment you stop and
take a backtrace, and then look at the syntactic surroundings of each
line in the backtrace and write down a list of all the 'with' blocks
that the code *looks* like it's inside, then context lookups should
give the same result as they would if you simply entered all of those
with blocks in order. Generators make it tricky to maintain this
invariant, because a generator frame's backtrace changes every time
you call next(). But those are the semantics that make the most sense
to me, and seem least surprising in practice. These are also IIUC the
semantics that exc_info is supposed to follow (though historically the
interaction of exc_info and generators has had lots of bugs, not sure
if that's been fixed or not).

...and now that I've written that down, I sort of feel like that might
be what you want for all the other sorts of context object too? Like,
here's a convoluted example:

def gen():
    a = decimal.Decimal("1.111")
    b = decimal.Decimal("2.222")
    print(a + b)
    yield
    print(a + b)

def caller():
    # let's pretend this context manager exists, the actual API is
more complicated
    with decimal_context_precision(3):
        g = gen()
    with decimal_context_precision(2):
        next(g)
    with decimal_context_precision(1):
        next(g)

Currently, this will print "3.3 3", because when the generator is
resumed it inherits the context of the resuming site. With PEP 550, it
would print "3.33 3.33" (or maybe "3.3 3.3"? it's not totally clear
from the text), because it inherits the context when the generator is
created and then ignores the calling context. It's hard to get strong
intuitions, but I feel like the current behavior is actually more
sensible -- each time the generator gets resumed, the next bit of code
runs in the context of whoever called next(), and the generator is
just passively inheriting context, so ... that makes sense.

OTOH of course if you change the generator code to:

def gen():
    a = decimal.Decimal("1.111")
    b = decimal.Decimal("2.222")
    with decimal_context_precision(4):
        print(a + b)
        yield
        print(a + b)

then it should print "3.333 3.333", because the generator is
overriding the caller -- now when we resume the frame we're
re-entering the decimal_context_precision(4) block, so it should take
priority.

So ... maybe all context variables are "stack-like"?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org