Mailman 3 Re: [Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal - Python-ideas

Trent Nelson

November 2013

2:24 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:

...

Trent, I watched your video and read your slides. (Does the word "motormouth" mean anything to you? :-)

Side-effect of both a) not having time to rehearse, and b) trying to compress 153 slides into 45 minutes :-)

...

Finally, I'm not sure why you are so confrontational about the way Twisted and Tulip do things. We are doing things the only way they *can* be done without overhauling the entire CPython implementation (which you have proven will take several major release cycles, probably until 4.0). It's fine that you are looking further forward than most of us. I don't think it makes sense that you are blaming the rest of us for writing libraries that can be used today.

I watched the video today; there's a point where I say something along the lines of "that's not how you should do IOCP; they're doing it wrong". That definitely came out wrong -- when limited to a single-threaded execution model, which today's Python is, then calling GetQueuedCompletionStatus() in a single-threaded event loop is really the only option you have. (I think I also say "that's just as bad as select()"; I didn't mean that either -- it's definitely better than select() when you're limited to the single-threaded execution model. What I was trying to convey was that doing it like that wasn't really how IOCP was designed to be used -- which is why I dig into the intrinsic link between IOCP, async I/O and threading for so many slides.) And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).

...

Clearly your work isn't ready for python-dev -- it is just too speculative. I've moved python-dev to BCC and added python-ideas.

It possibly doesn't even belong on python-ideas -- if you are serious about wanting to change Linux or other *NIX variants, you'll have to go find a venue where people who do forward-looking kernel work hang out.

Yeah this e-mail was more of a final follow up to e-mails I sent to python-ideas last year re: the whole "alternate async approach" thread. (I would have replied to that thread directly, had I kept it in my inbox.) Trent.

Reply

Sign in to reply online Use email software

Andrew Barnert

6:31 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

From: Guido van Rossum <guido@python.org> Sent: Saturday, November 16, 2013 6:56 PM

...

Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.

But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)

I got the same impression from the presentation. First, I completely agree with the fact that most Unix servers are silly on Windows even in the single-threaded case—simulating epoll on top of single-threaded completion-based GQCS just so you can simulate a completion-based design on top of your simulated ready-based epoll is wasteful and overly complex. But that's a much more minor issue than taking advantage of Windows' integration between threading and async I/O, and one that many server frameworks have already fixed, and that PyParallel isn't necessary for. I also agree that using IOCP for a multi-threaded proactor instead of a single-threaded reactor plus dispatcher is a huge win in the kinds of shared-memory threaded apps that you can't write in CPython. From my experience building a streaming video server and an IRC-esque interactive communications server, using a reactor plus dispatcher on Windows means one core completely wasted, 40% less performance from the others, and much lower scalability; emulating a proactor on Unix on top of a reactor and dispatcher is around a 10% performance cost (plus a bit of extra code complexity). So, a threaded proactor wins, unless you really don't care about Windows. But PyParallel doesn't look like it supports such applications any better than stock CPython. As soon as you need to send data from one client to other clients, you're not in a shared-nothing parallel context anymore. Even less extreme cases than streaming video or chat, where all you need is, e.g., shared caching of dynamically-generated data, I don't see how you'd do that in PyParallel. If you can build a simple multi-user chat server with PyParallel, and show it using all my cores, that would be a lot more compelling.

Reply

Sign in to reply online Use email software

Andrew Barnert

7:41 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Nov 16, 2013, at 22:35, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

I wonder whether an explicit copy_to_main_thread (maybe both shallow and deep variants) function, maybe with a Queue subclass that called it automatically in the put method, would be sufficient for a decent class of applications to be built?

...

Windows is fine at scheduling ncores separate processes; it's just slow at _starting_ each one. And most servers aren't constantly creating and reaping processes; they create ncores processes at startup or when they're first needed. So if it takes 0.9 seconds instead of 0.2 to restart your server, is that a big enough problem to rewrite the whole server (and the interpreter)? As I said in my previous message, the benefit of being able to skip refcounting might make it worth doing. But avoiding process creation overhead isn't much if a win.

Reply

Sign in to reply online Use email software

Antoine Pitrou

10:30 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sun, 17 Nov 2013 16:35:23 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

This is a bit of an oversimplification. The cost of processes is not only the cost of spawning them. There is also the CPU cost of marshalling data between processes, and the memory cost of having duplicate structures and data in your various processes. (also, note that using a process pool generally amortizes the spawning cost quite well) Regards Antoine.

Reply

Sign in to reply online Use email software

Trent Nelson

10:34 p.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

(I saw that there were a number of additional e-mails echo'ing Guido's sentiment/concerns re: shared nothing. I picked this thread to reply to and tried to provide as much info as possible in lieu of replying to everyone individually.) On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:

...

I wish you had spent more time on explaining how IOCP works and less on judging other approaches.

Heh, it's funny, with previous presentations, I didn't labor the point anywhere near as much, and I found that when presenting to UNIX people, they were very defensive of the status quo. I probably over-compensated a little too much in the opposite direction this time; I don't think anyone is going to argue vehemently that the UNIX status quo is optimal on Windows; but a side-effect is that it unnecessarily slanders existing bodies of work (Twisted et al) that have undeniably improved the overall ecosystem over the past decade.

...

Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.

The only thing I'd add is that, when speaking in terms of socket servers and whatnot, it helps to visualize Python callbacks as "the bits of logic that need to run before invoking the next asynchronous call". Anything I/O related can be done via an asynchronous call; that's basically the exit point of the processing thread -- it dispatches the async WSARecv() (for example), then moves onto the next request in the I/O completion port's queue. When that WSARecv() returns, we get all the info we need from the completion context to figure out what we just did, and, based on the protocol we provided, what needs to be done next. So, we do a little more pure Python processing and then dispatch the next asynchronous call, which, in this case, might be a WSASend(); the thread then moves onto the next request in the queue. That's all handled by the PxSocket_IOLoop monstrosity: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l62... I got the inspiration for that implementation from CEval_FrameEx; you basically have one big inline method where you can go from anything to anything without having to call additional C functions; thus, doing back-to-back sends, for example, won't exhaust your stack. That allows us to do the dynamic switch between sync and async depending on protocol preference, current client load, number of active I/O hogs, that sort of thing: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l64... PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which limits its applicability. I want to expand that -- I should be able to connect any sort of end points together in any fashion -- similar to how ZeroMQ allows the bridge/fan-out/router type composition. An endpoint would be anything that allows me to initiate an async operation against it, e.g. file, device, socket, whatever. This is where Windows really shines, because you can literally do everything either synchronously or asynchronously. There should also be support for 1:m and m:n relationships between endpoints (i.e. an IRC chat server). So I see PxSocket_IOLoop turning into a more generic PxThread_Loop that can handle anything-to-anything -- basically, calling the Python code that needs to run before dispatching the next async call. The current implementation also does a lot of live introspection against the protocol object to figure out what to do next; i.e. first entry point for a newly-connected client is here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... At every entry point into the loop, and at every point *after* the relevant Python code has been run, we're relying on the protocol to tell us what to do next in a very hard-coded fashion. I think for PxThread_Loop to become truly dynamic, it should mirror CEval_FrameEx even closer; the protocol analysis should be done separately, the output of which is a stream of async-opcode bytes that direct the main dispatching logic: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... dispatch: switch (next_opcode) { TARGET(maybe_shutdown_send_or_recv); TARGET(handle_error); TARGET(connection_made_callback); TARGET(data_received_callback); TARGET(send_complete_callback); TARGET(overlapped_recv_callback); TARGET(post_callback_that_supports_sending_retval); TARGET(post_callback_that_does_not_support_sending_retval); TARGET(close_); TARGET(try_send); default: break; } Then we'd have one big case statement just like with CEval_FrameEx that handles all possible async-opcodes, rather than the goto spaghetti in the current PxSocket_IOLoop. The async opcodes would be generic and platform-independent; i.e. file write, file read, single socket write, multi-socket write, etc. On Windows/Solaris/AIX, everything could be handled asynchronously, on other platforms, you would have to fake it using an event loop + multiplex method, identical to how twisted/tornado/tulip do it currently.

...

But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)

Ok, so, heh, I lied in the presentation. The main thread won't be frozen per-se, and the parallel threads will have a way to share state. I've already done a huge amount of work on this, but it's very involved and that presentation was long enough as it is. Also, it's easier to understand why reference counting and GC isn't needed in parallel contexts if you just assume the main thread isn't running. In reality, one of the first things I had to figure out was how these parallel contexts could communicate state back to the main thread -- because without this ability, how the heck would you propagate an exception raised in a parallel thread back to the main thread? The exception will be backed my memory allocated in the parallel context -- that can't be free'd until the exception has been dealt with and no references to it remain. As that was one of the first problems I had to solve, it has one of the hackiest implementations :-) The main thread's async.run_once() implementation can detect which parallel threads raised exceptions (because they've done an interlocked push to the main thread's error list) and it will extend the lifetime of the context for an additional number of subsequent runs of run_once(). Once the TTL of the context drops to 0, it is finally released. The reason it's hacky is because there's no direct correlation between the exception object finally having no references to it and the point we destroy the context. If you persisted the exception object to a list in the main thread somewhere, you'd segfault down the track when trying to access that memory. So, on the second iteration, I came up with some new concepts; context persistence and object promotion. A main-thread list or dict could be async protected such that this would work: # main thread d1 = {} l1 = [] async.protect(d1) async.protect(l1) # this would also work d2 = async.dict() l2 = async.list() # fyi: async.rdtsc() returns a PyLong wrapped # version of the CPU TSC; handy for generating # non-interned objects allocated from a parallel # context def callback(name): d1[name] = async.rdtsc() l1.append(async.rdtsc()) async.submit_work(callback, 'foo') async.submit_work(callback, 'bar') async.submit_work(callback, 'moo') async.run() That actually works; the async.protect() call intercepts the object's tp_as_mapping and tp_as_sequence fields and redirects them to thread-safe versions that use read/write locks. It also toggles a persistence bit on both the parallel context and the parallel long object, such that reference counting *is* actually enabled on it once it's back in the main thread -- when the ref count drops to 0, we check to see if it's an object that's been persisted, and if so, we decref the original context -- when the context's refcount gets to 0, only *then* do we free it. (I also did some stuff where you could promote simple objects where it made sense -- i.e. there's no need to keep a 4k context around if the end result was a scalar that could be represented in 50-200 bytes; just memcpy it from the main thread ("promote it to a main thread object with reference counting") and free the context.) You can see some examples of the different type of stuff you can do here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_pri... The problem though was that none of my unit tests assigned more than ten items to a list/dict, so I never encountered a resize :-) You can imagine what happens when a resize takes place within a parallel context -- the list/dict is realloc'd using the parallel context heap allocator -- that's not ideal, it's a main thread object, it shouldn't be reallocated with temporary parallel thread memory. I think that was the point where I went "oh, bollocks!" and switched over to tackling the async socket stuff. However, the async socket work forced me to implement all sorts of new concepts, including the heap snapshots and TLS heap overrides (for interned strings). Pair that with the page locking stuff and I have a much richer set of tools at my disposal to solve that problem -- I just need to completely overhaul everything memory related now that I know how it needs to be implemented :-) As for the dict/list assignment/resize, the key problem is figuring out whether a PyObject_Realloc call is taking place because we're resizing a main thread container object -- that's not an easy thing to figure out -- all you have is a pointer at the time you need to make the decision. That's where the memory refactoring work comes in -- I'm still working on the details, but the general idea is that you'll be able to do very efficient pointer address tests against known base address masks to figure out the origin of the object and how the current memory request needs to be satisfied. The other option I played around with was an interlocked list type that is exposed directly to Python: x = async.xlist() def callback(): x.push(async.rdtsc()) for _ in xrange(0, 10): async.submit_work(callback) async.run() # interlocked flush of all results into a list. results = x.flush() The key difference between an interlocked list and a normal list is that an interlocked list has its very own localized heap, just like parallel contexts have; pushing a scalar onto the list automatically "promotes it". That is, the object is memcpy'd directly using the xlist's heap, and we can keep that heap alive independently to the parallel contexts that pushed objects onto it. I was also planning on using this as a waitable queue, so you could compose pipelines of producers/consumers and that sort of thing. Then I ran out of time :-)

...

I don't really care how well CHARGEN (I had to look it up) scales. For HTTP, it's great for serving static contents from a cache or from the filesystem, but if that's all you serve, why use Python? Real web apps use intricate combinations of databases, memcache, in-memory cache, and template expansion. The biggest difference you can make there is probably getting rid of the ORM in favor of more direct SQL, and next on the list would be reimplementing template expansion in C. (And heck, you could release the GIL while you're doing that. :-)

Agree with the general sentiment "if that's all you're doing, why use Python?". The async HTTP server should allow other things to be built on top of it such that it's adding value over and above, say, an apache instance serving static files.

...

And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).

I would love it if you could write a list of things a callback *cannot* do when it is in parallel mode. I believe that list includes mutating any kind of global/shared state (any object created in the main thread is read-only in parallel mode -- it seems you had to work hard to make string interning work, which is semantically transparent but mutates hidden global state). In addition (or, more likely, as a consequence!) a callback cannot create anything that lasts beyond the callback's lifetime, except for the brief time between the callback's return and the completion of the I/O operation involving the return value. (Actually, I missed how you do this -- doesn't this mean you cannot release the callback's heap until much later?)

So, I think I already answered that above. The next presentation (PyCon Montreal) will be purely focused on this stuff -- I've been beating the alternate approach to async I/O for long enough ;-)

...

So it seems that the price for extreme concurrency is the same as always -- you can only run purely functional code. Haskell fans won't mind, but for Python this seems to be putting the cart before the horse -- who wants to write Python with those constraints?

Basically it's all still work in progress, but the PyParallel-for- parallel-compute use case is very important. And there's no way that can be done without having a way to return the results of parallel computation back into the next stage of your pipeline where more analysis is done. Getting hired by Continuum is actually great for this use case; we're in the big data, parallel task decomposition space, after all, not the writing-async-socket-server business ;-) I know Peter and Travis are both very supportive of PyParallel so its just a matter of trying to find time to work on it between consultancy engagements. Trent.

Reply

Sign in to reply online Use email software

Antoine Pitrou

10:27 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:

...

I don't think they are attempting to attack the same problem. asyncio and similar frameworks (Twisted, Tornado, etc.) try to solve the issue of I/O concurrency, while you are trying to solve the issue of CPU parallelism (i.e. want Python to actually exploit several CPUs simutaneously: asyncio doesn't really care about that, although it has a primitive to let you communicate with subprocesses). Yes, you can want to "optimize" static data serving by using several CPU cores at once, but that sounds quite pointless except perhaps for a few niche situations (and as Guido says, there are perfectly good off-the-shelf solutions for efficient static data serving). I think most people who'd like the GIL removed are not I/O-bound. Regards Antoine.

Reply

Sign in to reply online Use email software

Trent Nelson

November 2013

2:24 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:

...

Trent, I watched your video and read your slides. (Does the word "motormouth" mean anything to you? :-)

Side-effect of both a) not having time to rehearse, and b) trying to compress 153 slides into 45 minutes :-)

...

Finally, I'm not sure why you are so confrontational about the way Twisted and Tulip do things. We are doing things the only way they *can* be done without overhauling the entire CPython implementation (which you have proven will take several major release cycles, probably until 4.0). It's fine that you are looking further forward than most of us. I don't think it makes sense that you are blaming the rest of us for writing libraries that can be used today.

I watched the video today; there's a point where I say something along the lines of "that's not how you should do IOCP; they're doing it wrong". That definitely came out wrong -- when limited to a single-threaded execution model, which today's Python is, then calling GetQueuedCompletionStatus() in a single-threaded event loop is really the only option you have. (I think I also say "that's just as bad as select()"; I didn't mean that either -- it's definitely better than select() when you're limited to the single-threaded execution model. What I was trying to convey was that doing it like that wasn't really how IOCP was designed to be used -- which is why I dig into the intrinsic link between IOCP, async I/O and threading for so many slides.) And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).

...

Clearly your work isn't ready for python-dev -- it is just too speculative. I've moved python-dev to BCC and added python-ideas.

It possibly doesn't even belong on python-ideas -- if you are serious about wanting to change Linux or other *NIX variants, you'll have to go find a venue where people who do forward-looking kernel work hang out.

Yeah this e-mail was more of a final follow up to e-mails I sent to python-ideas last year re: the whole "alternate async approach" thread. (I would have replied to that thread directly, had I kept it in my inbox.) Trent.

Reply

Sign in to reply online Use email software

Andrew Barnert

6:31 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

From: Guido van Rossum <guido@python.org> Sent: Saturday, November 16, 2013 6:56 PM

...

Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.

But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)

I got the same impression from the presentation. First, I completely agree with the fact that most Unix servers are silly on Windows even in the single-threaded case—simulating epoll on top of single-threaded completion-based GQCS just so you can simulate a completion-based design on top of your simulated ready-based epoll is wasteful and overly complex. But that's a much more minor issue than taking advantage of Windows' integration between threading and async I/O, and one that many server frameworks have already fixed, and that PyParallel isn't necessary for. I also agree that using IOCP for a multi-threaded proactor instead of a single-threaded reactor plus dispatcher is a huge win in the kinds of shared-memory threaded apps that you can't write in CPython. From my experience building a streaming video server and an IRC-esque interactive communications server, using a reactor plus dispatcher on Windows means one core completely wasted, 40% less performance from the others, and much lower scalability; emulating a proactor on Unix on top of a reactor and dispatcher is around a 10% performance cost (plus a bit of extra code complexity). So, a threaded proactor wins, unless you really don't care about Windows. But PyParallel doesn't look like it supports such applications any better than stock CPython. As soon as you need to send data from one client to other clients, you're not in a shared-nothing parallel context anymore. Even less extreme cases than streaming video or chat, where all you need is, e.g., shared caching of dynamically-generated data, I don't see how you'd do that in PyParallel. If you can build a simple multi-user chat server with PyParallel, and show it using all my cores, that would be a lot more compelling.

Reply

Sign in to reply online Use email software

Andrew Barnert

7:41 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Nov 16, 2013, at 22:35, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

I wonder whether an explicit copy_to_main_thread (maybe both shallow and deep variants) function, maybe with a Queue subclass that called it automatically in the put method, would be sufficient for a decent class of applications to be built?

...

Windows is fine at scheduling ncores separate processes; it's just slow at _starting_ each one. And most servers aren't constantly creating and reaping processes; they create ncores processes at startup or when they're first needed. So if it takes 0.9 seconds instead of 0.2 to restart your server, is that a big enough problem to rewrite the whole server (and the interpreter)? As I said in my previous message, the benefit of being able to skip refcounting might make it worth doing. But avoiding process creation overhead isn't much if a win.

Reply

Sign in to reply online Use email software

Antoine Pitrou

November 2013

10:30 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sun, 17 Nov 2013 16:35:23 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

This is a bit of an oversimplification. The cost of processes is not only the cost of spawning them. There is also the CPU cost of marshalling data between processes, and the memory cost of having duplicate structures and data in your various processes. (also, note that using a process pool generally amortizes the spawning cost quite well) Regards Antoine.

Reply

Sign in to reply online Use email software

Trent Nelson

10:34 p.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

(I saw that there were a number of additional e-mails echo'ing Guido's sentiment/concerns re: shared nothing. I picked this thread to reply to and tried to provide as much info as possible in lieu of replying to everyone individually.) On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:

...

I wish you had spent more time on explaining how IOCP works and less on judging other approaches.

Heh, it's funny, with previous presentations, I didn't labor the point anywhere near as much, and I found that when presenting to UNIX people, they were very defensive of the status quo. I probably over-compensated a little too much in the opposite direction this time; I don't think anyone is going to argue vehemently that the UNIX status quo is optimal on Windows; but a side-effect is that it unnecessarily slanders existing bodies of work (Twisted et al) that have undeniably improved the overall ecosystem over the past decade.

...

Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.

The only thing I'd add is that, when speaking in terms of socket servers and whatnot, it helps to visualize Python callbacks as "the bits of logic that need to run before invoking the next asynchronous call". Anything I/O related can be done via an asynchronous call; that's basically the exit point of the processing thread -- it dispatches the async WSARecv() (for example), then moves onto the next request in the I/O completion port's queue. When that WSARecv() returns, we get all the info we need from the completion context to figure out what we just did, and, based on the protocol we provided, what needs to be done next. So, we do a little more pure Python processing and then dispatch the next asynchronous call, which, in this case, might be a WSASend(); the thread then moves onto the next request in the queue. That's all handled by the PxSocket_IOLoop monstrosity: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l62... I got the inspiration for that implementation from CEval_FrameEx; you basically have one big inline method where you can go from anything to anything without having to call additional C functions; thus, doing back-to-back sends, for example, won't exhaust your stack. That allows us to do the dynamic switch between sync and async depending on protocol preference, current client load, number of active I/O hogs, that sort of thing: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l64... PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which limits its applicability. I want to expand that -- I should be able to connect any sort of end points together in any fashion -- similar to how ZeroMQ allows the bridge/fan-out/router type composition. An endpoint would be anything that allows me to initiate an async operation against it, e.g. file, device, socket, whatever. This is where Windows really shines, because you can literally do everything either synchronously or asynchronously. There should also be support for 1:m and m:n relationships between endpoints (i.e. an IRC chat server). So I see PxSocket_IOLoop turning into a more generic PxThread_Loop that can handle anything-to-anything -- basically, calling the Python code that needs to run before dispatching the next async call. The current implementation also does a lot of live introspection against the protocol object to figure out what to do next; i.e. first entry point for a newly-connected client is here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... At every entry point into the loop, and at every point *after* the relevant Python code has been run, we're relying on the protocol to tell us what to do next in a very hard-coded fashion. I think for PxThread_Loop to become truly dynamic, it should mirror CEval_FrameEx even closer; the protocol analysis should be done separately, the output of which is a stream of async-opcode bytes that direct the main dispatching logic: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... dispatch: switch (next_opcode) { TARGET(maybe_shutdown_send_or_recv); TARGET(handle_error); TARGET(connection_made_callback); TARGET(data_received_callback); TARGET(send_complete_callback); TARGET(overlapped_recv_callback); TARGET(post_callback_that_supports_sending_retval); TARGET(post_callback_that_does_not_support_sending_retval); TARGET(close_); TARGET(try_send); default: break; } Then we'd have one big case statement just like with CEval_FrameEx that handles all possible async-opcodes, rather than the goto spaghetti in the current PxSocket_IOLoop. The async opcodes would be generic and platform-independent; i.e. file write, file read, single socket write, multi-socket write, etc. On Windows/Solaris/AIX, everything could be handled asynchronously, on other platforms, you would have to fake it using an event loop + multiplex method, identical to how twisted/tornado/tulip do it currently.

...

But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)

Ok, so, heh, I lied in the presentation. The main thread won't be frozen per-se, and the parallel threads will have a way to share state. I've already done a huge amount of work on this, but it's very involved and that presentation was long enough as it is. Also, it's easier to understand why reference counting and GC isn't needed in parallel contexts if you just assume the main thread isn't running. In reality, one of the first things I had to figure out was how these parallel contexts could communicate state back to the main thread -- because without this ability, how the heck would you propagate an exception raised in a parallel thread back to the main thread? The exception will be backed my memory allocated in the parallel context -- that can't be free'd until the exception has been dealt with and no references to it remain. As that was one of the first problems I had to solve, it has one of the hackiest implementations :-) The main thread's async.run_once() implementation can detect which parallel threads raised exceptions (because they've done an interlocked push to the main thread's error list) and it will extend the lifetime of the context for an additional number of subsequent runs of run_once(). Once the TTL of the context drops to 0, it is finally released. The reason it's hacky is because there's no direct correlation between the exception object finally having no references to it and the point we destroy the context. If you persisted the exception object to a list in the main thread somewhere, you'd segfault down the track when trying to access that memory. So, on the second iteration, I came up with some new concepts; context persistence and object promotion. A main-thread list or dict could be async protected such that this would work: # main thread d1 = {} l1 = [] async.protect(d1) async.protect(l1) # this would also work d2 = async.dict() l2 = async.list() # fyi: async.rdtsc() returns a PyLong wrapped # version of the CPU TSC; handy for generating # non-interned objects allocated from a parallel # context def callback(name): d1[name] = async.rdtsc() l1.append(async.rdtsc()) async.submit_work(callback, 'foo') async.submit_work(callback, 'bar') async.submit_work(callback, 'moo') async.run() That actually works; the async.protect() call intercepts the object's tp_as_mapping and tp_as_sequence fields and redirects them to thread-safe versions that use read/write locks. It also toggles a persistence bit on both the parallel context and the parallel long object, such that reference counting *is* actually enabled on it once it's back in the main thread -- when the ref count drops to 0, we check to see if it's an object that's been persisted, and if so, we decref the original context -- when the context's refcount gets to 0, only *then* do we free it. (I also did some stuff where you could promote simple objects where it made sense -- i.e. there's no need to keep a 4k context around if the end result was a scalar that could be represented in 50-200 bytes; just memcpy it from the main thread ("promote it to a main thread object with reference counting") and free the context.) You can see some examples of the different type of stuff you can do here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_pri... The problem though was that none of my unit tests assigned more than ten items to a list/dict, so I never encountered a resize :-) You can imagine what happens when a resize takes place within a parallel context -- the list/dict is realloc'd using the parallel context heap allocator -- that's not ideal, it's a main thread object, it shouldn't be reallocated with temporary parallel thread memory. I think that was the point where I went "oh, bollocks!" and switched over to tackling the async socket stuff. However, the async socket work forced me to implement all sorts of new concepts, including the heap snapshots and TLS heap overrides (for interned strings). Pair that with the page locking stuff and I have a much richer set of tools at my disposal to solve that problem -- I just need to completely overhaul everything memory related now that I know how it needs to be implemented :-) As for the dict/list assignment/resize, the key problem is figuring out whether a PyObject_Realloc call is taking place because we're resizing a main thread container object -- that's not an easy thing to figure out -- all you have is a pointer at the time you need to make the decision. That's where the memory refactoring work comes in -- I'm still working on the details, but the general idea is that you'll be able to do very efficient pointer address tests against known base address masks to figure out the origin of the object and how the current memory request needs to be satisfied. The other option I played around with was an interlocked list type that is exposed directly to Python: x = async.xlist() def callback(): x.push(async.rdtsc()) for _ in xrange(0, 10): async.submit_work(callback) async.run() # interlocked flush of all results into a list. results = x.flush() The key difference between an interlocked list and a normal list is that an interlocked list has its very own localized heap, just like parallel contexts have; pushing a scalar onto the list automatically "promotes it". That is, the object is memcpy'd directly using the xlist's heap, and we can keep that heap alive independently to the parallel contexts that pushed objects onto it. I was also planning on using this as a waitable queue, so you could compose pipelines of producers/consumers and that sort of thing. Then I ran out of time :-)

...

I don't really care how well CHARGEN (I had to look it up) scales. For HTTP, it's great for serving static contents from a cache or from the filesystem, but if that's all you serve, why use Python? Real web apps use intricate combinations of databases, memcache, in-memory cache, and template expansion. The biggest difference you can make there is probably getting rid of the ORM in favor of more direct SQL, and next on the list would be reimplementing template expansion in C. (And heck, you could release the GIL while you're doing that. :-)

Agree with the general sentiment "if that's all you're doing, why use Python?". The async HTTP server should allow other things to be built on top of it such that it's adding value over and above, say, an apache instance serving static files.

...

And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).

I would love it if you could write a list of things a callback *cannot* do when it is in parallel mode. I believe that list includes mutating any kind of global/shared state (any object created in the main thread is read-only in parallel mode -- it seems you had to work hard to make string interning work, which is semantically transparent but mutates hidden global state). In addition (or, more likely, as a consequence!) a callback cannot create anything that lasts beyond the callback's lifetime, except for the brief time between the callback's return and the completion of the I/O operation involving the return value. (Actually, I missed how you do this -- doesn't this mean you cannot release the callback's heap until much later?)

So, I think I already answered that above. The next presentation (PyCon Montreal) will be purely focused on this stuff -- I've been beating the alternate approach to async I/O for long enough ;-)

...

So it seems that the price for extreme concurrency is the same as always -- you can only run purely functional code. Haskell fans won't mind, but for Python this seems to be putting the cart before the horse -- who wants to write Python with those constraints?

Basically it's all still work in progress, but the PyParallel-for- parallel-compute use case is very important. And there's no way that can be done without having a way to return the results of parallel computation back into the next stage of your pipeline where more analysis is done. Getting hired by Continuum is actually great for this use case; we're in the big data, parallel task decomposition space, after all, not the writing-async-socket-server business ;-) I know Peter and Travis are both very supportive of PyParallel so its just a matter of trying to find time to work on it between consultancy engagements. Trent.

Reply

Sign in to reply online Use email software

Antoine Pitrou

10:27 a.m.

New subject: [Python-Dev] PyParallel: alternate async I/O and GIL removal

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:

...

I don't think they are attempting to attack the same problem. asyncio and similar frameworks (Twisted, Tornado, etc.) try to solve the issue of I/O concurrency, while you are trying to solve the issue of CPU parallelism (i.e. want Python to actually exploit several CPUs simutaneously: asyncio doesn't really care about that, although it has a primitive to let you communicate with subprocesses). Yes, you can want to "optimize" static data serving by using several CPU cores at once, but that sounds quite pointless except perhaps for a few niche situations (and as Guido says, there are perfectly good off-the-shelf solutions for efficient static data serving). I think most people who'd like the GIL removed are not I/O-bound. Regards Antoine.

Reply

Sign in to reply online Use email software

Re: [Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal

Guido van Rossum

Trent Nelson

Guido van Rossum

Andrew Barnert

Nick Coghlan

Andrew Barnert

Charles-François Natali

Antoine Pitrou

Andrew Barnert

Trent Nelson

Nick Coghlan

Antoine Pitrou

Antoine Pitrou

Trent Nelson

Guido van Rossum

Andrew Barnert

Nick Coghlan

Andrew Barnert

Charles-François Natali

Antoine Pitrou

Andrew Barnert

Trent Nelson

Nick Coghlan

Antoine Pitrou

Antoine Pitrou

tags

participants (6)