[Python-Dev] Slides from today's parallel/async Python talk
Trent Nelson
trent at snakebite.org
Thu Mar 14 22:30:14 CET 2013
Cross-referenced to relevant bits of code where appropriate.
(And just a quick reminder regarding the code quality disclaimer:
I've been hacking away on this stuff relentlessly for a few months;
the aim has been to make continual forward progress without getting
bogged down in non-value-add busy work. Lots of wildly inconsistent
naming conventions and dead code that'll be cleaned up down the
track. And the relevance of any given struct will tend to be
proportional to how many unused members it has (homeless hoarder +
shopping cart analogy).)
On Thu, Mar 14, 2013 at 11:45:20AM -0700, Trent Nelson wrote:
> The basic premise is that parallel 'Context' objects (well, structs)
> are allocated for each parallel thread callback.
The 'Context' struct:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_private.h#l546
Allocated via new_context():
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4211
....also relevant, new_context_for_socket() (encapsulates a
client/server instance within a context).
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4300
Primary role of the context is to isolate the memory management.
This is achieved via 'Heap':
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_private.h#l281
(Which I sort of half started refactoring to use the _HEAD_EXTRA
approach when I thought I'd need to have a separate heap type for
some TLS avenue I explored -- turns out that wasn't necessary).
> The context persists for the lifetime of the "parallel work".
>
> The "lifetime of the parallel work" depends on what you're doing. For
> a simple ``async.submit_work(foo)``, the context is considered
> complete once ``foo()`` has been called (presuming no exceptions were
> raised).
Managing context lifetime is one of the main responsibilities of
async.run_once():
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3841
> For an async client/server, the context will persist for the entirety
> of the connection.
Marking a socket context as 'finished' for servers is the job of
PxServerSocket_ClientClosed():
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l6885
> The context is responsible for encapsulating all resources related to
> the parallel thread. So, it has its own heap, and all memory
> allocations are taken from that heap.
The heap is initialized in two steps during new_context(). First,
a handle is allocated for the underlying system heap (via
HeapCreate):
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4224
The first "heap" is then initialized for use with our context via
the Heap_Init(Context *c, size_t n, int page_size) call:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1921
Heaps are actually linked together via a doubly-linked list. The
first heap is a value member (not a pointer) of Context; however,
the active heap is always accessed via the '*h' pointer which is
updated as necessary.
struct Heap {
Heap *prev;
Heap *next;
void *base;
void *next;
int allocated;
int remaining;
...
struct Context {
Heap heap;
Heap *h;
...
> For any given parallel thread, only one context can be executing at a
> time, and this can be accessed via the ``__declspec(thread) Context
> *ctx`` global (which is primed by some glue code as soon as the
> parallel thread starts executing a callback).
Glue entry point for all callbacks is _PyParallel_EnteredCallback:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3047
On the topic of callbacks, the main workhorse for the
submit_(wait|work) callbacks is _PyParallel_WorkCallback:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3120
The interesting logic starts at start:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3251
The interesting part is the error handling. If the callback raises
an exception, we check to see if an errback has been provided. If
so, we call the errback with the error details.
If the callback completes successfully (or it fails, but the errback
completes successfully), that is treated as successful callback or
errback completion, respectively:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3270
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3294
If the errback fails, or no errback was provided, the exception
percolates back to the main thread. This is handled at error:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3300
This should make the behavior of async.run_once() clearer. The
first thing it does is check to see if any errors have been posted.
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3917
Errors are returned back to calling code on a first-error-wins
basis. (This involves fiddling with the context's lifetime, as
we're essentially propagating an object created in a parallel
context (the (exception, value, traceback) tuple) back to a main
thread context -- so, we can't blow away that context until the
exception has had a chance to properly bubble back up and be dealt
with.)
If there are no errors, we then check to see if any "call from main
thread" requests have been made:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3936
I added support for this in order to ease unit testing, but it has
general usefulness. It's exposed via two decorators:
@async.call_from_main_thread
def foo(arg):
...
def callback():
foo('abcd')
async.submit_work(callback)
That creates a parallel thread, invokes callback(), which then
results in foo(arg) eventually being called from the main thread.
This would be useful for synchronising access to a database or
something like that.
There's also @async.call_from_main_thread_and_wait, which I probably
should have mentioned first:
@async.call_from_main_thread_and_wait
def update_login_details(login, details)
db.update(login, details)
def foo():
...
update_login_details(x, y)
# execution will resume when the main thread finishes
# update_login_details()
...
async.submit_work(foo)
Once all "main thread work requests" have been processed, completed
callbacks and errbacks are processed. This basically just involves
transitioning the associated context onto the "path to freedom" (the
lifecycle that eventually results in the context being free()'d and
the heap being destroyed).
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l4032
> No reference counting or garbage collection is done during parallel
> thread execution. Instead, once the context is finished, it is
> scheduled to be released, which means it'll be "processed" by the main
> thread as part of its housekeeping work (during ``async.run()``
> (technically, ``async.run_once()``).
>
> The main thread simply destroys the entire heap in one fell swoop,
> releasing all memory that was associated with that context.
The "path to freedom" lifecycle is a bit complicated at the moment
and could definitely use a review. But, basically, the main methods
are _PxState_PurgeContexts() and _PxState_FreeContext(); the former
checks that the context is ready to be freed, the latter does the
actual freeing.
_PxState_PurgeContexts:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3789
_PxState_FreeContext:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l3700
The reason for the separation is to maintain bubbling effect; a
context only makes one transition per run_once() invocation.
Putting this in place was a key step to stop wild crashes in the
early days when unittest would keep hold of exceptions longer than I
was expecting -- it should probably be reviewed in light of the new
persistence support I implemented (much later).
> There are a few side effects to this. First, the heap allocator
> (basically, the thing that answers ``malloc()`` calls) is incredibly
> simple. It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB
> on x64), and simply returns pointers to that chunk for each memory
> request (adjusting h->next and allocation stats as it goes along,
> obviously). Once the 2MB has been exhausted, another 2MB is
> allocated.
_PyHeap_Malloc is the workhorse here:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l2183
Very simple, just keeps nudging along the h->next pointer for each
request, allocating another heap when necessary. Nice side effect
is that it's ridiculously fast and very cache friendly. Python code
running within parallel contexts runs faster than normal main-thread
code because of this (plus the boost from not doing any ref
counting).
The simplicity of this approach made the heap snapshot logic really
simple to implement too; taking a snapshot and then rolling back is
just a couple of memcpy's and some pointer fiddling.
> That approach is fine for the ``submit_(work|timer|wait)`` callbacks,
> which basically provide a way to run a presumably-finite-length
> function in a parallel thread (and invoking callbacks/errbacks as
> required).
>
> However, it breaks down when dealing with client/server stuff. Each
> invocation of a callback (say, ``data_received(...)``) may only
> consume, say, 500 bytes, but it might be called a million times before
> the connection is terminated. You can't have cumulative memory usage
> with possibly-infinite-length client/server-callbacks like you can
> with the once-off ``submit_(work|wait|timer)`` stuff.
>
> So, enter heap snapshots. The logic that handles all client/server
> connections is instrumented such that it takes a snapshot of the heap
> (and all associated stats) prior to invoking a Python method (via
> ``PyObject_Call()``, for example, i.e. the invocation of
> ``data_received``).
I came up with the heap snapshot stuff in a really perverse way.
The first cut introduced a new 'TLS heap' concept; the idea was that
before you'd call PyObject_CallObject(), you'd enable the TLS heap,
then roll it back when you were done. i.e. the socket IO loop code
had a lot of stuff like this:
snapshot = ENABLE_TLS_HEAP();
if (!PyObject_CallObject(...)) {
DISABLE_TLS_HEAP_AND_ROLLBACK(snapshot);
...
}
DISABLE_TLS_HEAP();
...
/* do stuff */
ROLLBACK_TLS_HEAP(snapshot);
That was fine initially, until I had to deal with the (pretty
common) case of allocating memory from the TLS heap (say, for an
async send), and then having the callback picked up by a different
thread. That thread then had to return the other thread's snapshot
and, well, it just fell apart conceptually.
Then it dawned on me to just add the snapshot/rollback stuff to
normal Context objects. In retrospect, it's silly I didn't think of
this in the first place -- the biggest advantage of the Context
abstraction is that it's thread-local, but not bindingly so (as in,
it'll only ever run on one thread at a time, but it doesn't matter
which one, which is essential, because the ).
Once I switched out all the TLS heap cruft for Context-specific heap
snapshots, everything "Just Worked". (I haven't removed the TLS
heap stuff yet as I'm still using it elsewhere (where it doesn't
have the issue above). It's an xxx todo.)
The main consumer of this heap snapshot stuff (at the moment) is the
socket IO loop logic:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l5632
Typical usage now looks like this:
snapshot = PxContext_HeapSnapshot(c, NULL);
if (!PxSocket_LoadInitialBytes(s)) {
PxContext_RollbackHeap(c, &snapshot);
PxSocket_EXCEPTION();
}
/* at some later point... */
PxContext_RollbackHeap(c, &snapshot);
> When the method completes, we can simply roll back the snapshot. The
> heap's stats and next pointers et al all get reset back to what they
> were before the callback was invoked.
>
> The only issue with this approach is detecting when the callback has
> done the unthinkable (from a shared-nothing perspective) and persisted
> some random object it created outside of the parallel context it was
> created in.
>
> That's actually a huge separate technical issue to tackle -- and it
> applies just as much to the normal ``submit_(wait|work|timer)``
> callbacks as well. I've got a somewhat-temporary solution in place
> for that currently:
>
> That'll result in two contexts being created, one for each callback
> invocation. ``async.dict()`` is a "parallel safe" wrapper around a
> normal PyDict. This is referred to as "protection".
>
> In fact, the code above could have been written as follows:
>
> d = async.protect(dict())
>
> What ``protect()`` does is instrument the object such that we
> intercept ``__getitem__``, ``__setitem__``, ``__getattr__`` and
> ``__setattr__``.
The 'protect' details are pretty hairy. _protect does a few checks:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1368
....and then palms things off to _PyObject_PrepOrigType:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1054
That method is where the magic happens. We basically clone the type
object for the object we're protecting, then replace the setitem,
getitem etc methods with our counterparts (described next):
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1100
Note the voodoo involved in 'protecting' heap objects versus normal
C-type objects, GC objects versus non-GC, etc.
> We replace these methods with counterparts that
> serve two purposes:
>
> 1. The read-only methods are wrapped in a read-lock, the write
> methods are wrapped in a write lock (using underlying system slim
> read/write locks, which are uber fast). (Basically, you can have
> unlimited readers holding the read lock, but only one writer can hold
> the write lock (excluding all the readers and other writers).)
>
> 2. Detecting when parallel objects (objects created from within a
> parallel thread, and thus, backed by the parallel context's heap)
> have been assigned outside the context (in this case, to a
> "protected" dict object that was created from the main thread).
This is handled via _Px_objobjargproc_ass:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l900
That is responsible for detecting when a parallel object is being
assigned to a non-parallel object (and tries to persist the object
where necessary).
> The first point is important as it ensures concurrent access doesn't
> corrupt the data structure.
>
> The second point is important because it allows us to prevent the
> persisted object's context from automatically transitioning into the
> complete->release->heapdestroy lifecycle when the callback completes.
>
> This is known as "persistence", as in, a context has been persisted.
> All sorts of things happen to the object when we detect that it's been
> persisted. The biggest thing is that reference counting is enabled
> again for the object (from the perspective of the main thread; ref
> counting is still a no-op within the parallel thread) -- however, once
> the refcount hits zero, instead of free()ing the memory like we'd
> normally do in the main thread (or garbage collecting it), we decref
> the reference count of the owning context.
That's the job of _Px_TryPersist (called via _Px_objobjargproc_ass
as mentioned above):
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l861
That makes use of yet-another-incredibly-useful-Windows-feature
called 'init once'; basically, underlying system support for
ensuring something only gets done *once*. Perfect for avoiding race
conditions.
> Once the owning context's refcount goes to zero, we know that no more
> references exist to objects created from that parallel thread's
> execution, and we're free to release the context (and thus, destroy
> the heap -> free the memory).
All that magic is the unfortunate reason my lovely Py_INCREF/DECREF
overrides when from very simple to quite-a-bit-more-involved. i.e.
originally Py_INCREF was just:
#define Py_INCREF(o) (Py_PXCTX ? (void)0; Py_REFCNT(o)++);
With the advent of parallel object persistence and context-specific
refcounts, things become less simple:
Py_INCREF:
http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l890
890 __inline
891 void
892 _Py_IncRef(PyObject *op)
893 {
894 if ((!Py_PXCTX && (Py_ISPY(op) || Px_PERSISTED(op)))) {
895 _Py_INC_REFTOTAL;
896 (((PyObject*)(op))->ob_refcnt++);
897 }
898 }
Py_DECREF:
http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l911
909 __inline
910 void
911 _Py_DecRef(PyObject *op)
912 {
913 if (!Py_PXCTX) {
914 if (Px_PERSISTED(op))
915 Px_DECREF(op);
916 else if (!Px_ISPX(op)) {
917 _Py_DEC_REFTOTAL;
918 if ((--((PyObject *)(op))->ob_refcnt) != 0) {
919 _Py_CHECK_REFCNT(op);
920 } else
921 _Py_Dealloc((PyObject *)(op));
922 }
923 }
924 }
> That's currently implemented and works very well. There are a few
> drawbacks: one, the user must only assign to an "async protected"
> object. Use a normal dict and you're going to segfault or corrupt
> things (or worse) pretty quickly.
>
> Second, we're persisting the entire context potentially for a single
> object. The context may be huge; think of some data processing
> callback that ran for ages, racked up a 100MB footprint, but only
> generated a PyLong with the value 42 at the end, which consumes, like,
> 50 bytes (or whatever the size of a PyLong is these days).
>
> It's crazy keeping a 100MB context around indefinitely until that
> PyLong object goes away, so, we need another option. The idea I have
> for that is "promotion". Rather than persist the context, the object
> is "promoted"; basically, the parallel thread palms it off to the main
> thread, which proceeds to deep-copy the object, and take over
> ownership. This removes the need for the context to be persisted.
>
> Now, I probably shouldn't have said "deep-copy" there. Promotion is a
> terrible option for anything other than simple objects (scalars). If
> you've got a huge list that consumes 98% of your 100MB heap footprint,
> well, persistence is perfect. If it's a 50 byte scalar, promotion is
> perfect. (Also, deep-copy implies collection interrogation, which has
> all sorts of complexities, so, err, I'll probably end up supporting
> promotion if the object is a scalar that can be shallow-copied. Any
> form of collection or non-scalar type will get persisted by default.)
>
> I haven't implemented promotion yet (persistence works well enough for
> now). And none of this is integrated into the heap snapshot/rollback
> logic -- i.e. we don't detect if a client/server callback assigned an
> object created in the parallel context to a main-thread object -- we
> just roll back blindly as soon as the callback completes.
>
> Before this ever has a chance of being eligible for adoption into
> CPython, those problems will need to be addressed. As much as I'd
> like to ignore those corner cases that violate the shared-nothing
> approach -- it's inevitable someone, somewhere, will be assigning
> parallel objects outside of the context, maybe for good reason, maybe
> by accident, maybe because they don't know any better. Whatever the
> reason, the result shouldn't be corruption.
>
> So, the remaining challenge is preventing the use case alluded to
> earlier where someone tries to modify an object that hasn't been
> "async protected". That's a bit harder. The idea I've got in mind is
> to instrument the main CPython ceval loop, such that we do these
> checks as part of opcode processing. That allows us to keep all the
> logic in the one spot and not have to go hacking the internals of
> every single object's C backend to ensure correctness.
>
> Now, that'll probably work to an extent. I mean, after all, there are
> opcodes for all the things we'd be interested in instrumenting,
> LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is
> detecting arbitrary mutations via object calls, i.e. how do we know,
> during the ceval loop, that foo.append(x) needs to be treated
> specially if foo is a main-thread object and x is a parallel thread
> object?
>
> There may be no way to handle that *other* than hacking the internals
> of each object, unfortunately. So, the viability of this whole
> approach may rest on whether or that's deemed as an acceptable
> tradeoff (a necessary evil, even) to the Python developer community.
Actually, I'd sort of forgotten that I started adding protection
support for lists in _PyObject_PrepOrigType. Well, technically,
support for intercepting PySequenceMethods:
http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l1126
I settled for just intercepting PyMappingMethods initially, which
is why that chunk of code is commented out. Intercepting the
mapping methods allowed me to implement the async protection for
dicts and generic objects, which was sufficient for testing
purposes at the time.
So, er, I guess my point is that automatically detecting object
mutation might not be as hard as I'm alluding to above. I'll be
happy if we're able to simply raise an exception if you attempt to
mutate a non-protected main-thread object. That's infinitely better
than segfaulting or silent corruption.
Trent.
More information about the Python-Dev
mailing list