[Python-Dev] Slides from today's parallel/async Python talk

Trent Nelson trent at snakebite.org
Thu Mar 14 22:30:14 CET 2013

    Cross-referenced to relevant bits of code where appropriate.

    (And just a quick reminder regarding the code quality disclaimer:
    I've been hacking away on this stuff relentlessly for a few months;
    the aim has been to make continual forward progress without getting
    bogged down in non-value-add busy work.  Lots of wildly inconsistent
    naming conventions and dead code that'll be cleaned up down the
    track.  And the relevance of any given struct will tend to be
    proportional to how many unused members it has (homeless hoarder +
    shopping cart analogy).)

On Thu, Mar 14, 2013 at 11:45:20AM -0700, Trent Nelson wrote:
> The basic premise is that parallel 'Context' objects (well, structs)
> are allocated for each parallel thread callback.

    The 'Context' struct:


    Allocated via new_context():


    ....also relevant, new_context_for_socket() (encapsulates a
    client/server instance within a context).


    Primary role of the context is to isolate the memory management.
    This is achieved via 'Heap':


    (Which I sort of half started refactoring to use the _HEAD_EXTRA
     approach when I thought I'd need to have a separate heap type for
     some TLS avenue I explored -- turns out that wasn't necessary).

> The context persists for the lifetime of the "parallel work".
> The "lifetime of the parallel work" depends on what you're doing.  For
> a simple ``async.submit_work(foo)``, the context is considered
> complete once ``foo()`` has been called (presuming no exceptions were
> raised).

    Managing context lifetime is one of the main responsibilities of


> For an async client/server, the context will persist for the entirety
> of the connection.

    Marking a socket context as 'finished' for servers is the job of


> The context is responsible for encapsulating all resources related to
> the parallel thread.  So, it has its own heap, and all memory
> allocations are taken from that heap.

    The heap is initialized in two steps during new_context().  First,
    a handle is allocated for the underlying system heap (via


    The first "heap" is then initialized for use with our context via
    the Heap_Init(Context *c, size_t n, int page_size) call:


    Heaps are actually linked together via a doubly-linked list.  The
    first heap is a value member (not a pointer) of Context; however,
    the active heap is always accessed via the '*h' pointer which is
    updated as necessary.

        struct Heap {
            Heap *prev;
            Heap *next;
            void *base;
            void *next;
            int   allocated;
            int   remaining;

        struct Context {
            Heap  heap;
            Heap *h;

> For any given parallel thread, only one context can be executing at a
> time, and this can be accessed via the ``__declspec(thread) Context
> *ctx`` global (which is primed by some glue code as soon as the
> parallel thread starts executing a callback).

    Glue entry point for all callbacks is _PyParallel_EnteredCallback:


    On the topic of callbacks, the main workhorse for the
    submit_(wait|work) callbacks is _PyParallel_WorkCallback:


    The interesting logic starts at start:


    The interesting part is the error handling.  If the callback raises
    an exception, we check to see if an errback has been provided.  If
    so, we call the errback with the error details.

    If the callback completes successfully (or it fails, but the errback
    completes successfully), that is treated as successful callback or
    errback completion, respectively:



    If the errback fails, or no errback was provided, the exception
    percolates back to the main thread.  This is handled at error:


    This should make the behavior of async.run_once() clearer.  The
    first thing it does is check to see if any errors have been posted.


    Errors are returned back to calling code on a first-error-wins
    basis.  (This involves fiddling with the context's lifetime, as
    we're essentially propagating an object created in a parallel
    context (the (exception, value, traceback) tuple) back to a main
    thread context -- so, we can't blow away that context until the
    exception has had a chance to properly bubble back up and be dealt

    If there are no errors, we then check to see if any "call from main
    thread" requests have been made:


    I added support for this in order to ease unit testing, but it has
    general usefulness.  It's exposed via two decorators:

    def foo(arg):

    def callback():


    That creates a parallel thread, invokes callback(), which then
    results in foo(arg) eventually being called from the main thread.
    This would be useful for synchronising access to a database or
    something like that.

    There's also @async.call_from_main_thread_and_wait, which I probably
    should have mentioned first:

    def update_login_details(login, details)
        db.update(login, details)

    def foo():
        update_login_details(x, y)
        # execution will resume when the main thread finishes
        # update_login_details()


    Once all "main thread work requests" have been processed, completed
    callbacks and errbacks are processed.  This basically just involves
    transitioning the associated context onto the "path to freedom" (the
    lifecycle that eventually results in the context being free()'d and
    the heap being destroyed).


> No reference counting or garbage collection is done during parallel
> thread execution.  Instead, once the context is finished, it is
> scheduled to be released, which means it'll be "processed" by the main
> thread as part of its housekeeping work (during ``async.run()``
> (technically, ``async.run_once()``).
> The main thread simply destroys the entire heap in one fell swoop,
> releasing all memory that was associated with that context.

    The "path to freedom" lifecycle is a bit complicated at the moment
    and could definitely use a review.  But, basically, the main methods
    are _PxState_PurgeContexts() and _PxState_FreeContext(); the former
    checks that the context is ready to be freed, the latter does the
    actual freeing.





    The reason for the separation is to maintain bubbling effect; a
    context only makes one transition per run_once() invocation.
    Putting this in place was a key step to stop wild crashes in the
    early days when unittest would keep hold of exceptions longer than I
    was expecting -- it should probably be reviewed in light of the new
    persistence support I implemented (much later).

> There are a few side effects to this.  First, the heap allocator
> (basically, the thing that answers ``malloc()`` calls) is incredibly
> simple.  It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB
> on x64), and simply returns pointers to that chunk for each memory
> request (adjusting h->next and allocation stats as it goes along,
> obviously).  Once the 2MB has been exhausted, another 2MB is
> allocated.

    _PyHeap_Malloc is the workhorse here:


    Very simple, just keeps nudging along the h->next pointer for each
    request, allocating another heap when necessary.  Nice side effect
    is that it's ridiculously fast and very cache friendly.  Python code
    running within parallel contexts runs faster than normal main-thread
    code because of this (plus the boost from not doing any ref

    The simplicity of this approach made the heap snapshot logic really
    simple to implement too; taking a snapshot and then rolling back is
    just a couple of memcpy's and some pointer fiddling.

> That approach is fine for the ``submit_(work|timer|wait)`` callbacks,
> which basically provide a way to run a presumably-finite-length
> function in a parallel thread (and invoking callbacks/errbacks as
> required).
> However, it breaks down when dealing with client/server stuff.  Each
> invocation of a callback (say, ``data_received(...)``) may only
> consume, say, 500 bytes, but it might be called a million times before
> the connection is terminated.  You can't have cumulative memory usage
> with possibly-infinite-length client/server-callbacks like you can
> with the once-off ``submit_(work|wait|timer)`` stuff.
> So, enter heap snapshots.  The logic that handles all client/server
> connections is instrumented such that it takes a snapshot of the heap
> (and all associated stats) prior to invoking a Python method (via
> ``PyObject_Call()``, for example, i.e. the invocation of
> ``data_received``).

    I came up with the heap snapshot stuff in a really perverse way.
    The first cut introduced a new 'TLS heap' concept; the idea was that
    before you'd call PyObject_CallObject(), you'd enable the TLS heap,
    then roll it back when you were done.  i.e. the socket IO loop code
    had a lot of stuff like this:

    snapshot = ENABLE_TLS_HEAP();
    if (!PyObject_CallObject(...)) {

    /* do stuff */

    That was fine initially, until I had to deal with the (pretty
    common) case of allocating memory from the TLS heap (say, for an
    async send), and then having the callback picked up by a different
    thread.  That thread then had to return the other thread's snapshot
    and, well, it just fell apart conceptually.

    Then it dawned on me to just add the snapshot/rollback stuff to
    normal Context objects.  In retrospect, it's silly I didn't think of
    this in the first place -- the biggest advantage of the Context
    abstraction is that it's thread-local, but not bindingly so (as in,
    it'll only ever run on one thread at a time, but it doesn't matter
    which one, which is essential, because the ).

    Once I switched out all the TLS heap cruft for Context-specific heap
    snapshots, everything "Just Worked".  (I haven't removed the TLS
    heap stuff yet as I'm still using it elsewhere (where it doesn't
    have the issue above).  It's an xxx todo.)

    The main consumer of this heap snapshot stuff (at the moment) is the
    socket IO loop logic:


    Typical usage now looks like this:

       snapshot = PxContext_HeapSnapshot(c, NULL);
       if (!PxSocket_LoadInitialBytes(s)) {
           PxContext_RollbackHeap(c, &snapshot);

       /* at some later point... */
       PxContext_RollbackHeap(c, &snapshot);

> When the method completes, we can simply roll back the snapshot.  The
> heap's stats and next pointers et al all get reset back to what they
> were before the callback was invoked.
> The only issue with this approach is detecting when the callback has
> done the unthinkable (from a shared-nothing perspective) and persisted
> some random object it created outside of the parallel context it was
> created in.
> That's actually a huge separate technical issue to tackle -- and it
> applies just as much to the normal ``submit_(wait|work|timer)``
> callbacks as well.  I've got a somewhat-temporary solution in place
> for that currently:
> That'll result in two contexts being created, one for each callback
> invocation.  ``async.dict()`` is a "parallel safe" wrapper around a
> normal PyDict.  This is referred to as "protection".
> In fact, the code above could have been written as follows:
>     d = async.protect(dict())
> What ``protect()`` does is instrument the object such that we
> intercept ``__getitem__``, ``__setitem__``, ``__getattr__`` and
> ``__setattr__``.

    The 'protect' details are pretty hairy.  _protect does a few checks:


    ....and then palms things off to _PyObject_PrepOrigType:


    That method is where the magic happens.  We basically clone the type
    object for the object we're protecting, then replace the setitem,
    getitem etc methods with our counterparts (described next):


    Note the voodoo involved in 'protecting' heap objects versus normal
    C-type objects, GC objects versus non-GC, etc.

> We replace these methods with counterparts that
> serve two purposes:
>  1. The read-only methods are wrapped in a read-lock, the write
>  methods are wrapped in a write lock (using underlying system slim
>  read/write locks, which are uber fast).  (Basically, you can have
>  unlimited readers holding the read lock, but only one writer can hold
>  the write lock (excluding all the readers and other writers).)
>  2. Detecting when parallel objects (objects created from within a
>  parallel thread, and thus, backed by the parallel context's heap)
>  have been assigned outside the context (in this case, to a
>  "protected" dict object that was created from the main thread).

    This is handled via _Px_objobjargproc_ass:


    That is responsible for detecting when a parallel object is being
    assigned to a non-parallel object (and tries to persist the object
    where necessary).

> The first point is important as it ensures concurrent access doesn't
> corrupt the data structure.
> The second point is important because it allows us to prevent the
> persisted object's context from automatically transitioning into the
> complete->release->heapdestroy lifecycle when the callback completes.
> This is known as "persistence", as in, a context has been persisted.
> All sorts of things happen to the object when we detect that it's been
> persisted.  The biggest thing is that reference counting is enabled
> again for the object (from the perspective of the main thread; ref
> counting is still a no-op within the parallel thread) -- however, once
> the refcount hits zero, instead of free()ing the memory like we'd
> normally do in the main thread (or garbage collecting it), we decref
> the reference count of the owning context.

    That's the job of _Px_TryPersist (called via _Px_objobjargproc_ass
    as mentioned above):


    That makes use of yet-another-incredibly-useful-Windows-feature
    called 'init once'; basically, underlying system support for
    ensuring something only gets done *once*.  Perfect for avoiding race

> Once the owning context's refcount goes to zero, we know that no more
> references exist to objects created from that parallel thread's
> execution, and we're free to release the context (and thus, destroy
> the heap -> free the memory).

    All that magic is the unfortunate reason my lovely Py_INCREF/DECREF
    overrides when from very simple to quite-a-bit-more-involved.  i.e.
    originally Py_INCREF was just:

#define Py_INCREF(o) (Py_PXCTX ? (void)0; Py_REFCNT(o)++);

    With the advent of parallel object persistence and context-specific
    refcounts, things become less simple:

   890 __inline
   891 void
   892 _Py_IncRef(PyObject *op)
   893 {
   894     if ((!Py_PXCTX && (Py_ISPY(op) || Px_PERSISTED(op)))) {
   895         _Py_INC_REFTOTAL;
   896         (((PyObject*)(op))->ob_refcnt++);
   897     }
   898 }

   909 __inline
   910 void
   911 _Py_DecRef(PyObject *op)
   912 {
   913     if (!Py_PXCTX) {
   914         if (Px_PERSISTED(op))
   915             Px_DECREF(op);
   916         else if (!Px_ISPX(op)) {
   917             _Py_DEC_REFTOTAL;
   918             if ((--((PyObject *)(op))->ob_refcnt) != 0) {
   919                 _Py_CHECK_REFCNT(op);
   920             } else
   921                 _Py_Dealloc((PyObject *)(op));
   922         }
   923     }
   924 }

> That's currently implemented and works very well.  There are a few
> drawbacks: one, the user must only assign to an "async protected"
> object.  Use a normal dict and you're going to segfault or corrupt
> things (or worse) pretty quickly.
> Second, we're persisting the entire context potentially for a single
> object.  The context may be huge; think of some data processing
> callback that ran for ages, racked up a 100MB footprint, but only
> generated a PyLong with the value 42 at the end, which consumes, like,
> 50 bytes (or whatever the size of a PyLong is these days).
> It's crazy keeping a 100MB context around indefinitely until that
> PyLong object goes away, so, we need another option.  The idea I have
> for that is "promotion".  Rather than persist the context, the object
> is "promoted"; basically, the parallel thread palms it off to the main
> thread, which proceeds to deep-copy the object, and take over
> ownership.  This removes the need for the context to be persisted.
> Now, I probably shouldn't have said "deep-copy" there.  Promotion is a
> terrible option for anything other than simple objects (scalars).  If
> you've got a huge list that consumes 98% of your 100MB heap footprint,
> well, persistence is perfect.  If it's a 50 byte scalar, promotion is
> perfect.  (Also, deep-copy implies collection interrogation, which has
> all sorts of complexities, so, err, I'll probably end up supporting
> promotion if the object is a scalar that can be shallow-copied.  Any
> form of collection or non-scalar type will get persisted by default.)
> I haven't implemented promotion yet (persistence works well enough for
> now).  And none of this is integrated into the heap snapshot/rollback
> logic -- i.e. we don't detect if a client/server callback assigned an
> object created in the parallel context to a main-thread object -- we
> just roll back blindly as soon as the callback completes.
> Before this ever has a chance of being eligible for adoption into
> CPython, those problems will need to be addressed.  As much as I'd
> like to ignore those corner cases that violate the shared-nothing
> approach -- it's inevitable someone, somewhere, will be assigning
> parallel objects outside of the context, maybe for good reason, maybe
> by accident, maybe because they don't know any better.  Whatever the
> reason, the result shouldn't be corruption.
> So, the remaining challenge is preventing the use case alluded to
> earlier where someone tries to modify an object that hasn't been
> "async protected".  That's a bit harder.  The idea I've got in mind is
> to instrument the main CPython ceval loop, such that we do these
> checks as part of opcode processing.  That allows us to keep all the
> logic in the one spot and not have to go hacking the internals of
> every single object's C backend to ensure correctness.
> Now, that'll probably work to an extent.  I mean, after all, there are
> opcodes for all the things we'd be interested in instrumenting,
> LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc.  What becomes challenging is
> detecting arbitrary mutations via object calls, i.e. how do we know,
> during the ceval loop, that foo.append(x) needs to be treated
> specially if foo is a main-thread object and x is a parallel thread
> object?
> There may be no way to handle that *other* than hacking the internals
> of each object, unfortunately.  So, the viability of this whole
> approach may rest on whether or that's deemed as an acceptable
> tradeoff (a necessary evil, even) to the Python developer community.

    Actually, I'd sort of forgotten that I started adding protection
    support for lists in _PyObject_PrepOrigType.  Well, technically,
    support for intercepting PySequenceMethods:


    I settled for just intercepting PyMappingMethods initially, which
    is why that chunk of code is commented out.  Intercepting the
    mapping methods allowed me to implement the async protection for
    dicts and generic objects, which was sufficient for testing
    purposes at the time.

    So, er, I guess my point is that automatically detecting object
    mutation might not be as hard as I'm alluding to above.  I'll be
    happy if we're able to simply raise an exception if you attempt to
    mutate a non-protected main-thread object.  That's infinitely better
    than segfaulting or silent corruption.


More information about the Python-Dev mailing list