[Python-Dev] Slides from today's parallel/async Python talk

Thu Mar 14 19:45:20 CET 2013

On Wed, Mar 13, 2013 at 07:05:41PM -0700, Trent Nelson wrote:
>     Just posted the slides for those that didn't have the benefit of
>     attending the language summit today:
> 
>         https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-alternate-approach-to-async

    Someone on /r/python asked if I could elaborate on the "do Y" part
    of "if we're in a parallel thread, do Y, if not, do X", which I
    (inadvertently) ended up replying to in detail.  I've included the
    response below.  (I'll work on converting this into a TL;DR set of
    slides soon.)

> Can you go into a bit of depth about "X" here?

That's a huge topic that I'm hoping to tackle ASAP.  The basic premise
is that parallel 'Context' objects (well, structs) are allocated for
each parallel thread callback.  The context persists for the lifetime of
the "parallel work".

The "lifetime of the parallel work" depends on what you're doing.  For a
simple ``async.submit_work(foo)``, the context is considered complete
once ``foo()`` has been called (presuming no exceptions were raised).

For an async client/server, the context will persist for the entirety of
the connection.

The context is responsible for encapsulating all resources related to
the parallel thread.  So, it has its own heap, and all memory
allocations are taken from that heap.

For any given parallel thread, only one context can be executing at a
time, and this can be accessed via the ``__declspec(thread) Context
*ctx`` global (which is primed by some glue code as soon as the parallel
thread starts executing a callback).

No reference counting or garbage collection is done during parallel
thread execution.  Instead, once the context is finished, it is
scheduled to be released, which means it'll be "processed" by the main
thread as part of its housekeeping work (during ``async.run()``
(technically, ``async.run_once()``).

The main thread simply destroys the entire heap in one fell swoop,
releasing all memory that was associated with that context.

There are a few side effects to this.  First, the heap allocator
(basically, the thing that answers ``malloc()`` calls) is incredibly
simple.  It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB on
x64), and simply returns pointers to that chunk for each memory request
(adjusting h->next and allocation stats as it goes along, obviously).
Once the 2MB has been exhausted, another 2MB is allocated.

That approach is fine for the ``submit_(work|timer|wait)`` callbacks,
which basically provide a way to run a presumably-finite-length function
in a parallel thread (and invoking callbacks/errbacks as required).

However, it breaks down when dealing with client/server stuff.  Each
invocation of a callback (say, ``data_received(...)``) may only consume,
say, 500 bytes, but it might be called a million times before the
connection is terminated.  You can't have cumulative memory usage with
possibly-infinite-length client/server-callbacks like you can with the
once-off ``submit_(work|wait|timer)`` stuff.

So, enter heap snapshots.  The logic that handles all client/server
connections is instrumented such that it takes a snapshot of the heap
(and all associated stats) prior to invoking a Python method (via
``PyObject_Call()``, for example, i.e. the invocation of
``data_received``).

When the method completes, we can simply roll back the snapshot.  The
heap's stats and next pointers et al all get reset back to what they
were before the callback was invoked.

That's how the chargen server is able to pump out endless streams of
data for every client whilst keeping memory usage static.  (Well, every
new client currently consumes at least a minimum of 2MB (but down the
track that can be tweaked back down to SMALL_PAGE_SIZE, 4096, for
servers that need to handle hundreds of thousands of clients
simultaneously).

The only issue with this approach is detecting when the callback has
done the unthinkable (from a shared-nothing perspective) and persisted
some random object it created outside of the parallel context it was
created in.

That's actually a huge separate technical issue to tackle -- and it
applies just as much to the normal ``submit_(wait|work|timer)``
callbacks as well.  I've got a somewhat-temporary solution in place for
that currently:

    d = async.dict()
    def foo():
        # async.rdtsc() is a helper method
        # that basically wraps the result of
        # the assembly RDTSC (read time-
        # stamp counter) instruction into a
        # PyLong object.  So, it's handy when
        # I need to test the very functionality
        # being demonstrated here (creating
        # an object within a parallel context
        # and persisting it elsewhere).
        d['foo'] = async.rdtsc()

    def bar():
        d['bar'] = async.rdtsc()

    async.submit_work(foo)
    async.submit_work(bar)

That'll result in two contexts being created, one for each callback
invocation.  ``async.dict()`` is a "parallel safe" wrapper around a
normal PyDict.  This is referred to as "protection".

In fact, the code above could have been written as follows:

    d = async.protect(dict())

What ``protect()`` does is instrument the object such that we intercept
``__getitem__``, ``__setitem__``, ``__getattr__`` and ``__setattr__``.
We replace these methods with counterparts that serve two purposes:

 1. The read-only methods are wrapped in a read-lock, the write
    methods are wrapped in a write lock (using underlying system slim
    read/write locks, which are uber fast).  (Basically, you can have
    unlimited readers holding the read lock, but only one writer can
    hold the write lock (excluding all the readers and other writers).)

 2. Detecting when parallel objects (objects created from within a
    parallel thread, and thus, backed by the parallel context's heap)
    have been assigned outside the context (in this case, to a
    "protected" dict object that was created from the main thread).

The first point is important as it ensures concurrent access doesn't
corrupt the data structure.

The second point is important because it allows us to prevent the
persisted object's context from automatically transitioning into the
complete->release->heapdestroy lifecycle when the callback completes.

This is known as "persistence", as in, a context has been persisted.
All sorts of things happen to the object when we detect that it's been
persisted.  The biggest thing is that reference counting is enabled
again for the object (from the perspective of the main thread; ref
counting is still a no-op within the parallel thread) -- however, once
the refcount hits zero, instead of free()ing the memory like we'd
normally do in the main thread (or garbage collecting it), we decref
the reference count of the owning context.

Once the owning context's refcount goes to zero, we know that no more
references exist to objects created from that parallel thread's
execution, and we're free to release the context (and thus, destroy the
heap -> free the memory).

That's currently implemented and works very well.  There are a few
drawbacks: one, the user must only assign to an "async protected"
object.  Use a normal dict and you're going to segfault or corrupt
things (or worse) pretty quickly.

Second, we're persisting the entire context potentially for a single
object.  The context may be huge; think of some data processing callback
that ran for ages, racked up a 100MB footprint, but only generated a
PyLong with the value 42 at the end, which consumes, like, 50 bytes (or
whatever the size of a PyLong is these days).

It's crazy keeping a 100MB context around indefinitely until that PyLong
object goes away, so, we need another option.  The idea I have for that
is "promotion".  Rather than persist the context, the object is
"promoted"; basically, the parallel thread palms it off to the main
thread, which proceeds to deep-copy the object, and take over ownership.
This removes the need for the context to be persisted.

Now, I probably shouldn't have said "deep-copy" there.  Promotion is a
terrible option for anything other than simple objects (scalars).  If
you've got a huge list that consumes 98% of your 100MB heap footprint,
well, persistence is perfect.  If it's a 50 byte scalar, promotion is
perfect.  (Also, deep-copy implies collection interrogation, which has
all sorts of complexities, so, err, I'll probably end up supporting
promotion if the object is a scalar that can be shallow-copied.  Any
form of collection or non-scalar type will get persisted by default.)

I haven't implemented promotion yet (persistence works well enough for
now).  And none of this is integrated into the heap snapshot/rollback
logic -- i.e. we don't detect if a client/server callback assigned an
object created in the parallel context to a main-thread object -- we
just roll back blindly as soon as the callback completes.

Before this ever has a chance of being eligible for adoption into
CPython, those problems will need to be addressed.  As much as I'd like
to ignore those corner cases that violate the shared-nothing approach --
it's inevitable someone, somewhere, will be assigning parallel objects
outside of the context, maybe for good reason, maybe by accident, maybe
because they don't know any better.  Whatever the reason, the result
shouldn't be corruption.

So, the remaining challenge is preventing the use case alluded to
earlier where someone tries to modify an object that hasn't been "async
protected".  That's a bit harder.  The idea I've got in mind is to
instrument the main CPython ceval loop, such that we do these checks as
part of opcode processing.  That allows us to keep all the logic in the
one spot and not have to go hacking the internals of every single
object's C backend to ensure correctness.

Now, that'll probably work to an extent.  I mean, after all, there are
opcodes for all the things we'd be interested in instrumenting,
LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc.  What becomes challenging is
detecting arbitrary mutations via object calls, i.e. how do we know,
during the ceval loop, that foo.append(x) needs to be treated specially
if foo is a main-thread object and x is a parallel thread object?

There may be no way to handle that *other* than hacking the internals of
each object, unfortunately.  So, the viability of this whole approach
may rest on whether or that's deemed as an acceptable tradeoff (a
necessary evil, even) to the Python developer community.

If it's not, then it's unlikely this approach will ever see the light of
day in CPython.  If that turns out to be the case, then I see this
project taking the path that Stackless took (forking off and becoming a
separate interpreter).

There's nothing wrong with that; I am really excited about the
possibilities afforded by this approach, and I'm sure it will pique the
interest of commercial entities out there that have problems perfectly
suited to where this pattern excels (shared-nothing, highly concurrent),
much like the relationship that developed between Stackless and Eve
Online.

So, it'd be great if it eventually saw the light of day in CPython, but
that'll be a long way down the track (at least 4.x I'd say), and all
these issues that allow you to instantly segfault or corrupt the
interpreter will need to be addressed before it's even eligible for
*discussion* about inclusion.
</snip>

    Regards,

        Trent.