[Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal
Trent Nelson
trent at snakebite.org
Sun Nov 17 23:34:32 CET 2013
(I saw that there were a number of additional e-mails echo'ing
Guido's sentiment/concerns re: shared nothing. I picked this
thread to reply to and tried to provide as much info as possible
in lieu of replying to everyone individually.)
On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:
> I wish you had spent more time on explaining how IOCP works and less on
> judging other approaches.
Heh, it's funny, with previous presentations, I didn't labor the
point anywhere near as much, and I found that when presenting to
UNIX people, they were very defensive of the status quo.
I probably over-compensated a little too much in the opposite
direction this time; I don't think anyone is going to argue
vehemently that the UNIX status quo is optimal on Windows; but
a side-effect is that it unnecessarily slanders existing bodies
of work (Twisted et al) that have undeniably improved the overall
ecosystem over the past decade.
> Summarizing my understanding of what you're saying, it seems the "right"
> way to use IOCP on a multi-core machine is to have one thread per core
> (barring threads you need for unavoidably blocking stuff) and to let the
> kernel schedule callbacks on all those threads. As long as the callbacks
> don't block and events come in at a rate to keep all those cores busy this
> will be optimal.
The only thing I'd add is that, when speaking in terms of socket
servers and whatnot, it helps to visualize Python callbacks as "the
bits of logic that need to run before invoking the next asynchronous
call".
Anything I/O related can be done via an asynchronous call; that's
basically the exit point of the processing thread -- it dispatches
the async WSARecv() (for example), then moves onto the next request
in the I/O completion port's queue.
When that WSARecv() returns, we get all the info we need from the
completion context to figure out what we just did, and, based on the
protocol we provided, what needs to be done next.
So, we do a little more pure Python processing and then dispatch the
next asynchronous call, which, in this case, might be a WSASend();
the thread then moves onto the next request in the queue.
That's all handled by the PxSocket_IOLoop monstrosity:
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6246
I got the inspiration for that implementation from CEval_FrameEx;
you basically have one big inline method where you can go from
anything to anything without having to call additional C functions;
thus, doing back-to-back sends, for example, won't exhaust your
stack.
That allows us to do the dynamic switch between sync and async
depending on protocol preference, current client load, number of
active I/O hogs, that sort of thing:
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6467
PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which
limits its applicability. I want to expand that -- I should be able
to connect any sort of end points together in any fashion -- similar
to how ZeroMQ allows the bridge/fan-out/router type composition.
An endpoint would be anything that allows me to initiate an async
operation against it, e.g. file, device, socket, whatever. This
is where Windows really shines, because you can literally do
everything either synchronously or asynchronously. There should
also be support for 1:m and m:n relationships between endpoints
(i.e. an IRC chat server).
So I see PxSocket_IOLoop turning into a more generic PxThread_Loop
that can handle anything-to-anything -- basically, calling the
Python code that needs to run before dispatching the next async
call.
The current implementation also does a lot of live introspection
against the protocol object to figure out what to do next; i.e.
first entry point for a newly-connected client is here:
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6326
At every entry point into the loop, and at every point *after* the
relevant Python code has been run, we're relying on the protocol to
tell us what to do next in a very hard-coded fashion.
I think for PxThread_Loop to become truly dynamic, it should mirror
CEval_FrameEx even closer; the protocol analysis should be done
separately, the output of which is a stream of async-opcode bytes
that direct the main dispatching logic:
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6305
dispatch:
switch (next_opcode) {
TARGET(maybe_shutdown_send_or_recv);
TARGET(handle_error);
TARGET(connection_made_callback);
TARGET(data_received_callback);
TARGET(send_complete_callback);
TARGET(overlapped_recv_callback);
TARGET(post_callback_that_supports_sending_retval);
TARGET(post_callback_that_does_not_support_sending_retval);
TARGET(close_);
TARGET(try_send);
default:
break;
}
Then we'd have one big case statement just like with CEval_FrameEx
that handles all possible async-opcodes, rather than the goto
spaghetti in the current PxSocket_IOLoop.
The async opcodes would be generic and platform-independent; i.e.
file write, file read, single socket write, multi-socket write, etc.
On Windows/Solaris/AIX, everything could be handled asynchronously,
on other platforms, you would have to fake it using an event loop +
multiplex method, identical to how twisted/tornado/tulip do it
currently.
> But this is almost tautological. It only works if the threads don't
> communicate with each other or with the main thread (all shared data must
> be read-only). But heh, if that's all, one process per core works just as
> well. :-)
Ok, so, heh, I lied in the presentation. The main thread won't be
frozen per-se, and the parallel threads will have a way to share
state. I've already done a huge amount of work on this, but it's
very involved and that presentation was long enough as it is.
Also, it's easier to understand why reference counting and GC isn't
needed in parallel contexts if you just assume the main thread isn't
running.
In reality, one of the first things I had to figure out was how
these parallel contexts could communicate state back to the main
thread -- because without this ability, how the heck would you
propagate an exception raised in a parallel thread back to the main
thread? The exception will be backed my memory allocated in the
parallel context -- that can't be free'd until the exception has
been dealt with and no references to it remain.
As that was one of the first problems I had to solve, it has one of
the hackiest implementations :-) The main thread's async.run_once()
implementation can detect which parallel threads raised exceptions
(because they've done an interlocked push to the main thread's error
list) and it will extend the lifetime of the context for an
additional number of subsequent runs of run_once(). Once the TTL of
the context drops to 0, it is finally released.
The reason it's hacky is because there's no direct correlation
between the exception object finally having no references to it and
the point we destroy the context. If you persisted the exception
object to a list in the main thread somewhere, you'd segfault down
the track when trying to access that memory.
So, on the second iteration, I came up with some new concepts;
context persistence and object promotion. A main-thread list or
dict could be async protected such that this would work:
# main thread
d1 = {}
l1 = []
async.protect(d1)
async.protect(l1)
# this would also work
d2 = async.dict()
l2 = async.list()
# fyi: async.rdtsc() returns a PyLong wrapped
# version of the CPU TSC; handy for generating
# non-interned objects allocated from a parallel
# context
def callback(name):
d1[name] = async.rdtsc()
l1.append(async.rdtsc())
async.submit_work(callback, 'foo')
async.submit_work(callback, 'bar')
async.submit_work(callback, 'moo')
async.run()
That actually works; the async.protect() call intercepts the
object's tp_as_mapping and tp_as_sequence fields and redirects them
to thread-safe versions that use read/write locks. It also toggles
a persistence bit on both the parallel context and the parallel long
object, such that reference counting *is* actually enabled on it
once it's back in the main thread -- when the ref count drops to 0,
we check to see if it's an object that's been persisted, and if so,
we decref the original context -- when the context's refcount gets
to 0, only *then* do we free it.
(I also did some stuff where you could promote simple objects
where it made sense -- i.e. there's no need to keep a 4k context
around if the end result was a scalar that could be represented
in 50-200 bytes; just memcpy it from the main thread ("promote
it to a main thread object with reference counting") and free
the context.)
You can see some examples of the different type of stuff you can do
here:
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_primitives.py
The problem though was that none of my unit tests assigned more than
ten items to a list/dict, so I never encountered a resize :-)
You can imagine what happens when a resize takes place within a
parallel context -- the list/dict is realloc'd using the parallel
context heap allocator -- that's not ideal, it's a main thread
object, it shouldn't be reallocated with temporary parallel thread
memory.
I think that was the point where I went "oh, bollocks!" and switched
over to tackling the async socket stuff.
However, the async socket work forced me to implement all sorts of
new concepts, including the heap snapshots and TLS heap overrides
(for interned strings).
Pair that with the page locking stuff and I have a much richer set
of tools at my disposal to solve that problem -- I just need to
completely overhaul everything memory related now that I know how it
needs to be implemented :-)
As for the dict/list assignment/resize, the key problem is figuring
out whether a PyObject_Realloc call is taking place because we're
resizing a main thread container object -- that's not an easy thing
to figure out -- all you have is a pointer at the time you need to
make the decision.
That's where the memory refactoring work comes in -- I'm still
working on the details, but the general idea is that you'll be able
to do very efficient pointer address tests against known base
address masks to figure out the origin of the object and how the
current memory request needs to be satisfied.
The other option I played around with was an interlocked list type
that is exposed directly to Python:
x = async.xlist()
def callback():
x.push(async.rdtsc())
for _ in xrange(0, 10):
async.submit_work(callback)
async.run()
# interlocked flush of all results into a list.
results = x.flush()
The key difference between an interlocked list and a normal list is
that an interlocked list has its very own localized heap, just like
parallel contexts have; pushing a scalar onto the list automatically
"promotes it". That is, the object is memcpy'd directly using the
xlist's heap, and we can keep that heap alive independently to the
parallel contexts that pushed objects onto it.
I was also planning on using this as a waitable queue, so you could
compose pipelines of producers/consumers and that sort of thing.
Then I ran out of time :-)
> I don't really care how well CHARGEN (I had to look it up) scales. For
> HTTP, it's great for serving static contents from a cache or from the
> filesystem, but if that's all you serve, why use Python? Real web apps use
> intricate combinations of databases, memcache, in-memory cache, and
> template expansion. The biggest difference you can make there is probably
> getting rid of the ORM in favor of more direct SQL, and next on the list
> would be reimplementing template expansion in C. (And heck, you could
> release the GIL while you're doing that. :-)
Agree with the general sentiment "if that's all you're doing, why
use Python?". The async HTTP server should allow other things to be
built on top of it such that it's adding value over and above, say,
an apache instance serving static files.
> And in hindsight, perhaps I need to put more emphasis on the fact
> that it *is* very experimental work with a long-term view, versus
> Tulip/asyncio, which was intended for *now*. So although Tulip and
> PyParallel spawned from the same discussions and are attempting to
> attack the same problem -- it's really not fair for me to discredit
> Tulip/Twisted in favor of PyParallel because they're on completely
> different playing fields with vastly different implementation time
> frames (I'm thinking 5+ years before this work lands in a mainstream
> Python release -- if it ever does. And if not, hey, it can live on
> as another interpreter, just like Stackless et al).
>
> I would love it if you could write a list of things a callback *cannot* do
> when it is in parallel mode. I believe that list includes mutating any
> kind of global/shared state (any object created in the main thread is
> read-only in parallel mode -- it seems you had to work hard to make string
> interning work, which is semantically transparent but mutates hidden
> global state). In addition (or, more likely, as a consequence!) a callback
> cannot create anything that lasts beyond the callback's lifetime, except
> for the brief time between the callback's return and the completion of the
> I/O operation involving the return value. (Actually, I missed how you do
> this -- doesn't this mean you cannot release the callback's heap until
> much later?)
So, I think I already answered that above. The next presentation
(PyCon Montreal) will be purely focused on this stuff -- I've been
beating the alternate approach to async I/O for long enough ;-)
> So it seems that the price for extreme concurrency is the same as always
> -- you can only run purely functional code. Haskell fans won't mind, but
> for Python this seems to be putting the cart before the horse -- who wants
> to write Python with those constraints?
Basically it's all still work in progress, but the PyParallel-for-
parallel-compute use case is very important. And there's no way
that can be done without having a way to return the results of
parallel computation back into the next stage of your pipeline where
more analysis is done.
Getting hired by Continuum is actually great for this use case;
we're in the big data, parallel task decomposition space, after all,
not the writing-async-socket-server business ;-) I know Peter and
Travis are both very supportive of PyParallel so its just a matter
of trying to find time to work on it between consultancy engagements.
Trent.
More information about the Python-ideas
mailing list