Re: [Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal

Trent, I watched your video and read your slides. (Does the word "motormouth" mean anything to you? :-) Clearly your work isn't ready for python-dev -- it is just too speculative. I've moved python-dev to BCC and added python-ideas. It possibly doesn't even belong on python-ideas -- if you are serious about wanting to change Linux or other *NIX variants, you'll have to go find a venue where people who do forward-looking kernel work hang out. Finally, I'm not sure why you are so confrontational about the way Twisted and Tulip do things. We are doing things the only way they *can* be done without overhauling the entire CPython implementation (which you have proven will take several major release cycles, probably until 4.0). It's fine that you are looking further forward than most of us. I don't think it makes sense that you are blaming the rest of us for writing libraries that can be used today. On Sat, Nov 16, 2013 at 10:13 AM, Trent Nelson <trent@snakebite.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:
Trent, I watched your video and read your slides. (Does the word "motormouth" mean anything to you? :-)
Side-effect of both a) not having time to rehearse, and b) trying to compress 153 slides into 45 minutes :-)
I watched the video today; there's a point where I say something along the lines of "that's not how you should do IOCP; they're doing it wrong". That definitely came out wrong -- when limited to a single-threaded execution model, which today's Python is, then calling GetQueuedCompletionStatus() in a single-threaded event loop is really the only option you have. (I think I also say "that's just as bad as select()"; I didn't mean that either -- it's definitely better than select() when you're limited to the single-threaded execution model. What I was trying to convey was that doing it like that wasn't really how IOCP was designed to be used -- which is why I dig into the intrinsic link between IOCP, async I/O and threading for so many slides.) And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).
Yeah this e-mail was more of a final follow up to e-mails I sent to python-ideas last year re: the whole "alternate async approach" thread. (I would have replied to that thread directly, had I kept it in my inbox.) Trent.

On Sat, Nov 16, 2013 at 6:24 PM, Trent Nelson <trent@snakebite.org> wrote:
On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:
[snip]
I wish you had spent more time on explaining how IOCP works and less on judging other approaches. Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal. But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-) I don't really care how well CHARGEN (I had to look it up) scales. For HTTP, it's great for serving static contents from a cache or from the filesystem, but if that's all you serve, why use Python? Real web apps use intricate combinations of databases, memcache, in-memory cache, and template expansion. The biggest difference you can make there is probably getting rid of the ORM in favor of more direct SQL, and next on the list would be reimplementing template expansion in C. (And heck, you could release the GIL while you're doing that. :-) And in hindsight, perhaps I need to put more emphasis on the fact
I would love it if you could write a list of things a callback *cannot* do when it is in parallel mode. I believe that list includes mutating any kind of global/shared state (any object created in the main thread is read-only in parallel mode -- it seems you had to work hard to make string interning work, which is semantically transparent but mutates hidden global state). In addition (or, more likely, as a consequence!) a callback cannot create anything that lasts beyond the callback's lifetime, except for the brief time between the callback's return and the completion of the I/O operation involving the return value. (Actually, I missed how you do this -- doesn't this mean you cannot release the callback's heap until much later?) So it seems that the price for extreme concurrency is the same as always -- you can only run purely functional code. Haskell fans won't mind, but for Python this seems to be putting the cart before the horse -- who wants to write Python with those constraints? [snip] -- --Guido van Rossum (python.org/~guido)

From: Guido van Rossum <guido@python.org> Sent: Saturday, November 16, 2013 6:56 PM
Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.
But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)
I got the same impression from the presentation. First, I completely agree with the fact that most Unix servers are silly on Windows even in the single-threaded case—simulating epoll on top of single-threaded completion-based GQCS just so you can simulate a completion-based design on top of your simulated ready-based epoll is wasteful and overly complex. But that's a much more minor issue than taking advantage of Windows' integration between threading and async I/O, and one that many server frameworks have already fixed, and that PyParallel isn't necessary for. I also agree that using IOCP for a multi-threaded proactor instead of a single-threaded reactor plus dispatcher is a huge win in the kinds of shared-memory threaded apps that you can't write in CPython. From my experience building a streaming video server and an IRC-esque interactive communications server, using a reactor plus dispatcher on Windows means one core completely wasted, 40% less performance from the others, and much lower scalability; emulating a proactor on Unix on top of a reactor and dispatcher is around a 10% performance cost (plus a bit of extra code complexity). So, a threaded proactor wins, unless you really don't care about Windows. But PyParallel doesn't look like it supports such applications any better than stock CPython. As soon as you need to send data from one client to other clients, you're not in a shared-nothing parallel context anymore. Even less extreme cases than streaming video or chat, where all you need is, e.g., shared caching of dynamically-generated data, I don't see how you'd do that in PyParallel. If you can build a simple multi-user chat server with PyParallel, and show it using all my cores, that would be a lot more compelling.

On 17 November 2013 12:56, Guido van Rossum <guido@python.org> wrote:
MapReduce fans already do :) I think there's some interesting potential in Trent's PyParallel work, but it needs something analogous to Rust's ability to transfer object ownership between threads (thus enabling message passing) to expand beyond the simple worker thread model which is really only interesting on Windows (where processes are expensive - on *nix, processes are generally cheap enough that PyParallel is unlikely to be worth the hassle). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 16, 2013, at 22:35, Nick Coghlan <ncoghlan@gmail.com> wrote:
I wonder whether an explicit copy_to_main_thread (maybe both shallow and deep variants) function, maybe with a Queue subclass that called it automatically in the put method, would be sufficient for a decent class of applications to be built?
Windows is fine at scheduling ncores separate processes; it's just slow at _starting_ each one. And most servers aren't constantly creating and reaping processes; they create ncores processes at startup or when they're first needed. So if it takes 0.9 seconds instead of 0.2 to restart your server, is that a big enough problem to rewrite the whole server (and the interpreter)? As I said in my previous message, the benefit of being able to skip refcounting might make it worth doing. But avoiding process creation overhead isn't much if a win.

There's something which really bothers me (well, that has been bothering me since the beginning): """ Memory Deallocation within Parallel Contexts •These parallel contexts aren’t intended to be long-running bits of code/algorithm •Let’s not free() anything… •….and just blow away the entire heap via HeapFree() with one call, once the context has finished •Cons: oYou technically couldn’t do this: def work(): for x in xrange(0, 1000000000): … o(Why would you!) •So… there’s no point referencing counting objects allocated within parallel contexts! """ So basically, pyparallel solves the issue of garbage collection in a multi-threaded process by not doing garbage collection: yeah, sure, things get a lot simpler, but in real life, you do want to have loops such as above, I don't see how one could pretend otherwise. That's simply a show-stopper to me. In fact, I find the whole programming model completely puzzling. Depending on whether you're in the main thread or not: - you're only able to write to thread-local data (thread-local is the sense allocated by the current thread, not thread-specific): what happens if some parallel context calls "import foo"? - you won't be able to allocate/free many objects So, in fact, in your parallel contexts, you can do so little that it's IMO almost useless in practice. It's not Python - more like a cross between Haskell and Python - and moreover, it means that some code cannot be executed in parallel context. Which means that you basically cannot use any library, since you don't know what's doing under the hood (it might die with a MemoryError or an invalid write to main thread memory). Compare this to e.g. Go's goroutines and channels, and you'll see how one might solve those issues in a sensible way (including user-level thread multiplexing over kernel-level threads, and using epoll ;-). In short, I'm really skeptical, to say the least... cf

On Sun, 17 Nov 2013 16:35:23 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
This is a bit of an oversimplification. The cost of processes is not only the cost of spawning them. There is also the CPU cost of marshalling data between processes, and the memory cost of having duplicate structures and data in your various processes. (also, note that using a process pool generally amortizes the spawning cost quite well) Regards Antoine.

On Nov 17, 2013, at 2:30, Antoine Pitrou <solipsis@pitrou.net> wrote:
But PyParallel doesn't seem to provide _any_ way to pass data between threads. So, the fact that multiprocessing provides only a slow way to pass data between processes can't be considered a weakness. Any program that could be written in PyParallel can't see those costs.

(I saw that there were a number of additional e-mails echo'ing Guido's sentiment/concerns re: shared nothing. I picked this thread to reply to and tried to provide as much info as possible in lieu of replying to everyone individually.) On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:
I wish you had spent more time on explaining how IOCP works and less on judging other approaches.
Heh, it's funny, with previous presentations, I didn't labor the point anywhere near as much, and I found that when presenting to UNIX people, they were very defensive of the status quo. I probably over-compensated a little too much in the opposite direction this time; I don't think anyone is going to argue vehemently that the UNIX status quo is optimal on Windows; but a side-effect is that it unnecessarily slanders existing bodies of work (Twisted et al) that have undeniably improved the overall ecosystem over the past decade.
The only thing I'd add is that, when speaking in terms of socket servers and whatnot, it helps to visualize Python callbacks as "the bits of logic that need to run before invoking the next asynchronous call". Anything I/O related can be done via an asynchronous call; that's basically the exit point of the processing thread -- it dispatches the async WSARecv() (for example), then moves onto the next request in the I/O completion port's queue. When that WSARecv() returns, we get all the info we need from the completion context to figure out what we just did, and, based on the protocol we provided, what needs to be done next. So, we do a little more pure Python processing and then dispatch the next asynchronous call, which, in this case, might be a WSASend(); the thread then moves onto the next request in the queue. That's all handled by the PxSocket_IOLoop monstrosity: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l62... I got the inspiration for that implementation from CEval_FrameEx; you basically have one big inline method where you can go from anything to anything without having to call additional C functions; thus, doing back-to-back sends, for example, won't exhaust your stack. That allows us to do the dynamic switch between sync and async depending on protocol preference, current client load, number of active I/O hogs, that sort of thing: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l64... PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which limits its applicability. I want to expand that -- I should be able to connect any sort of end points together in any fashion -- similar to how ZeroMQ allows the bridge/fan-out/router type composition. An endpoint would be anything that allows me to initiate an async operation against it, e.g. file, device, socket, whatever. This is where Windows really shines, because you can literally do everything either synchronously or asynchronously. There should also be support for 1:m and m:n relationships between endpoints (i.e. an IRC chat server). So I see PxSocket_IOLoop turning into a more generic PxThread_Loop that can handle anything-to-anything -- basically, calling the Python code that needs to run before dispatching the next async call. The current implementation also does a lot of live introspection against the protocol object to figure out what to do next; i.e. first entry point for a newly-connected client is here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... At every entry point into the loop, and at every point *after* the relevant Python code has been run, we're relying on the protocol to tell us what to do next in a very hard-coded fashion. I think for PxThread_Loop to become truly dynamic, it should mirror CEval_FrameEx even closer; the protocol analysis should be done separately, the output of which is a stream of async-opcode bytes that direct the main dispatching logic: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... dispatch: switch (next_opcode) { TARGET(maybe_shutdown_send_or_recv); TARGET(handle_error); TARGET(connection_made_callback); TARGET(data_received_callback); TARGET(send_complete_callback); TARGET(overlapped_recv_callback); TARGET(post_callback_that_supports_sending_retval); TARGET(post_callback_that_does_not_support_sending_retval); TARGET(close_); TARGET(try_send); default: break; } Then we'd have one big case statement just like with CEval_FrameEx that handles all possible async-opcodes, rather than the goto spaghetti in the current PxSocket_IOLoop. The async opcodes would be generic and platform-independent; i.e. file write, file read, single socket write, multi-socket write, etc. On Windows/Solaris/AIX, everything could be handled asynchronously, on other platforms, you would have to fake it using an event loop + multiplex method, identical to how twisted/tornado/tulip do it currently.
Ok, so, heh, I lied in the presentation. The main thread won't be frozen per-se, and the parallel threads will have a way to share state. I've already done a huge amount of work on this, but it's very involved and that presentation was long enough as it is. Also, it's easier to understand why reference counting and GC isn't needed in parallel contexts if you just assume the main thread isn't running. In reality, one of the first things I had to figure out was how these parallel contexts could communicate state back to the main thread -- because without this ability, how the heck would you propagate an exception raised in a parallel thread back to the main thread? The exception will be backed my memory allocated in the parallel context -- that can't be free'd until the exception has been dealt with and no references to it remain. As that was one of the first problems I had to solve, it has one of the hackiest implementations :-) The main thread's async.run_once() implementation can detect which parallel threads raised exceptions (because they've done an interlocked push to the main thread's error list) and it will extend the lifetime of the context for an additional number of subsequent runs of run_once(). Once the TTL of the context drops to 0, it is finally released. The reason it's hacky is because there's no direct correlation between the exception object finally having no references to it and the point we destroy the context. If you persisted the exception object to a list in the main thread somewhere, you'd segfault down the track when trying to access that memory. So, on the second iteration, I came up with some new concepts; context persistence and object promotion. A main-thread list or dict could be async protected such that this would work: # main thread d1 = {} l1 = [] async.protect(d1) async.protect(l1) # this would also work d2 = async.dict() l2 = async.list() # fyi: async.rdtsc() returns a PyLong wrapped # version of the CPU TSC; handy for generating # non-interned objects allocated from a parallel # context def callback(name): d1[name] = async.rdtsc() l1.append(async.rdtsc()) async.submit_work(callback, 'foo') async.submit_work(callback, 'bar') async.submit_work(callback, 'moo') async.run() That actually works; the async.protect() call intercepts the object's tp_as_mapping and tp_as_sequence fields and redirects them to thread-safe versions that use read/write locks. It also toggles a persistence bit on both the parallel context and the parallel long object, such that reference counting *is* actually enabled on it once it's back in the main thread -- when the ref count drops to 0, we check to see if it's an object that's been persisted, and if so, we decref the original context -- when the context's refcount gets to 0, only *then* do we free it. (I also did some stuff where you could promote simple objects where it made sense -- i.e. there's no need to keep a 4k context around if the end result was a scalar that could be represented in 50-200 bytes; just memcpy it from the main thread ("promote it to a main thread object with reference counting") and free the context.) You can see some examples of the different type of stuff you can do here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_pri... The problem though was that none of my unit tests assigned more than ten items to a list/dict, so I never encountered a resize :-) You can imagine what happens when a resize takes place within a parallel context -- the list/dict is realloc'd using the parallel context heap allocator -- that's not ideal, it's a main thread object, it shouldn't be reallocated with temporary parallel thread memory. I think that was the point where I went "oh, bollocks!" and switched over to tackling the async socket stuff. However, the async socket work forced me to implement all sorts of new concepts, including the heap snapshots and TLS heap overrides (for interned strings). Pair that with the page locking stuff and I have a much richer set of tools at my disposal to solve that problem -- I just need to completely overhaul everything memory related now that I know how it needs to be implemented :-) As for the dict/list assignment/resize, the key problem is figuring out whether a PyObject_Realloc call is taking place because we're resizing a main thread container object -- that's not an easy thing to figure out -- all you have is a pointer at the time you need to make the decision. That's where the memory refactoring work comes in -- I'm still working on the details, but the general idea is that you'll be able to do very efficient pointer address tests against known base address masks to figure out the origin of the object and how the current memory request needs to be satisfied. The other option I played around with was an interlocked list type that is exposed directly to Python: x = async.xlist() def callback(): x.push(async.rdtsc()) for _ in xrange(0, 10): async.submit_work(callback) async.run() # interlocked flush of all results into a list. results = x.flush() The key difference between an interlocked list and a normal list is that an interlocked list has its very own localized heap, just like parallel contexts have; pushing a scalar onto the list automatically "promotes it". That is, the object is memcpy'd directly using the xlist's heap, and we can keep that heap alive independently to the parallel contexts that pushed objects onto it. I was also planning on using this as a waitable queue, so you could compose pipelines of producers/consumers and that sort of thing. Then I ran out of time :-)
Agree with the general sentiment "if that's all you're doing, why use Python?". The async HTTP server should allow other things to be built on top of it such that it's adding value over and above, say, an apache instance serving static files.
So, I think I already answered that above. The next presentation (PyCon Montreal) will be purely focused on this stuff -- I've been beating the alternate approach to async I/O for long enough ;-)
Basically it's all still work in progress, but the PyParallel-for- parallel-compute use case is very important. And there's no way that can be done without having a way to return the results of parallel computation back into the next stage of your pipeline where more analysis is done. Getting hired by Continuum is actually great for this use case; we're in the big data, parallel task decomposition space, after all, not the writing-async-socket-server business ;-) I know Peter and Travis are both very supportive of PyParallel so its just a matter of trying to find time to work on it between consultancy engagements. Trent.

On 18 Nov 2013 08:35, "Trent Nelson" <trent@snakebite.org> wrote:
Sweet, this is basically the Rust memory model, which is the direction I'd hoped you would end up going with this (hence why I was asking if you had looked into the details of Rust at the PyCon language summit). For anyone that hasn't looked at Rust, all variables are thread local by default. There are then two mechanisms for sharing with other threads: ownership transfer and promotion to the shared heap. All of this is baked into the compiler, so things like trying to access an object after sending it to another thread trigger a compile error. PyParallel has the additional complication of remaining compatible with standard code that assumes shared memory by default when running in serial mode, but that appears to be a manageable problem. Cheers, Nick.

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:
I don't think they are attempting to attack the same problem. asyncio and similar frameworks (Twisted, Tornado, etc.) try to solve the issue of I/O concurrency, while you are trying to solve the issue of CPU parallelism (i.e. want Python to actually exploit several CPUs simutaneously: asyncio doesn't really care about that, although it has a primitive to let you communicate with subprocesses). Yes, you can want to "optimize" static data serving by using several CPU cores at once, but that sounds quite pointless except perhaps for a few niche situations (and as Guido says, there are perfectly good off-the-shelf solutions for efficient static data serving). I think most people who'd like the GIL removed are not I/O-bound. Regards Antoine.

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:
I've just read the slides. You've done rather weird and audacious things. That was a very interesting read, thank you! Regards Antoine.

On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:
Trent, I watched your video and read your slides. (Does the word "motormouth" mean anything to you? :-)
Side-effect of both a) not having time to rehearse, and b) trying to compress 153 slides into 45 minutes :-)
I watched the video today; there's a point where I say something along the lines of "that's not how you should do IOCP; they're doing it wrong". That definitely came out wrong -- when limited to a single-threaded execution model, which today's Python is, then calling GetQueuedCompletionStatus() in a single-threaded event loop is really the only option you have. (I think I also say "that's just as bad as select()"; I didn't mean that either -- it's definitely better than select() when you're limited to the single-threaded execution model. What I was trying to convey was that doing it like that wasn't really how IOCP was designed to be used -- which is why I dig into the intrinsic link between IOCP, async I/O and threading for so many slides.) And in hindsight, perhaps I need to put more emphasis on the fact that it *is* very experimental work with a long-term view, versus Tulip/asyncio, which was intended for *now*. So although Tulip and PyParallel spawned from the same discussions and are attempting to attack the same problem -- it's really not fair for me to discredit Tulip/Twisted in favor of PyParallel because they're on completely different playing fields with vastly different implementation time frames (I'm thinking 5+ years before this work lands in a mainstream Python release -- if it ever does. And if not, hey, it can live on as another interpreter, just like Stackless et al).
Yeah this e-mail was more of a final follow up to e-mails I sent to python-ideas last year re: the whole "alternate async approach" thread. (I would have replied to that thread directly, had I kept it in my inbox.) Trent.

On Sat, Nov 16, 2013 at 6:24 PM, Trent Nelson <trent@snakebite.org> wrote:
On Sat, Nov 16, 2013 at 05:39:13PM -0800, Guido van Rossum wrote:
[snip]
I wish you had spent more time on explaining how IOCP works and less on judging other approaches. Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal. But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-) I don't really care how well CHARGEN (I had to look it up) scales. For HTTP, it's great for serving static contents from a cache or from the filesystem, but if that's all you serve, why use Python? Real web apps use intricate combinations of databases, memcache, in-memory cache, and template expansion. The biggest difference you can make there is probably getting rid of the ORM in favor of more direct SQL, and next on the list would be reimplementing template expansion in C. (And heck, you could release the GIL while you're doing that. :-) And in hindsight, perhaps I need to put more emphasis on the fact
I would love it if you could write a list of things a callback *cannot* do when it is in parallel mode. I believe that list includes mutating any kind of global/shared state (any object created in the main thread is read-only in parallel mode -- it seems you had to work hard to make string interning work, which is semantically transparent but mutates hidden global state). In addition (or, more likely, as a consequence!) a callback cannot create anything that lasts beyond the callback's lifetime, except for the brief time between the callback's return and the completion of the I/O operation involving the return value. (Actually, I missed how you do this -- doesn't this mean you cannot release the callback's heap until much later?) So it seems that the price for extreme concurrency is the same as always -- you can only run purely functional code. Haskell fans won't mind, but for Python this seems to be putting the cart before the horse -- who wants to write Python with those constraints? [snip] -- --Guido van Rossum (python.org/~guido)

From: Guido van Rossum <guido@python.org> Sent: Saturday, November 16, 2013 6:56 PM
Summarizing my understanding of what you're saying, it seems the "right" way to use IOCP on a multi-core machine is to have one thread per core (barring threads you need for unavoidably blocking stuff) and to let the kernel schedule callbacks on all those threads. As long as the callbacks don't block and events come in at a rate to keep all those cores busy this will be optimal.
But this is almost tautological. It only works if the threads don't communicate with each other or with the main thread (all shared data must be read-only). But heh, if that's all, one process per core works just as well. :-)
I got the same impression from the presentation. First, I completely agree with the fact that most Unix servers are silly on Windows even in the single-threaded case—simulating epoll on top of single-threaded completion-based GQCS just so you can simulate a completion-based design on top of your simulated ready-based epoll is wasteful and overly complex. But that's a much more minor issue than taking advantage of Windows' integration between threading and async I/O, and one that many server frameworks have already fixed, and that PyParallel isn't necessary for. I also agree that using IOCP for a multi-threaded proactor instead of a single-threaded reactor plus dispatcher is a huge win in the kinds of shared-memory threaded apps that you can't write in CPython. From my experience building a streaming video server and an IRC-esque interactive communications server, using a reactor plus dispatcher on Windows means one core completely wasted, 40% less performance from the others, and much lower scalability; emulating a proactor on Unix on top of a reactor and dispatcher is around a 10% performance cost (plus a bit of extra code complexity). So, a threaded proactor wins, unless you really don't care about Windows. But PyParallel doesn't look like it supports such applications any better than stock CPython. As soon as you need to send data from one client to other clients, you're not in a shared-nothing parallel context anymore. Even less extreme cases than streaming video or chat, where all you need is, e.g., shared caching of dynamically-generated data, I don't see how you'd do that in PyParallel. If you can build a simple multi-user chat server with PyParallel, and show it using all my cores, that would be a lot more compelling.

On 17 November 2013 12:56, Guido van Rossum <guido@python.org> wrote:
MapReduce fans already do :) I think there's some interesting potential in Trent's PyParallel work, but it needs something analogous to Rust's ability to transfer object ownership between threads (thus enabling message passing) to expand beyond the simple worker thread model which is really only interesting on Windows (where processes are expensive - on *nix, processes are generally cheap enough that PyParallel is unlikely to be worth the hassle). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 16, 2013, at 22:35, Nick Coghlan <ncoghlan@gmail.com> wrote:
I wonder whether an explicit copy_to_main_thread (maybe both shallow and deep variants) function, maybe with a Queue subclass that called it automatically in the put method, would be sufficient for a decent class of applications to be built?
Windows is fine at scheduling ncores separate processes; it's just slow at _starting_ each one. And most servers aren't constantly creating and reaping processes; they create ncores processes at startup or when they're first needed. So if it takes 0.9 seconds instead of 0.2 to restart your server, is that a big enough problem to rewrite the whole server (and the interpreter)? As I said in my previous message, the benefit of being able to skip refcounting might make it worth doing. But avoiding process creation overhead isn't much if a win.

There's something which really bothers me (well, that has been bothering me since the beginning): """ Memory Deallocation within Parallel Contexts •These parallel contexts aren’t intended to be long-running bits of code/algorithm •Let’s not free() anything… •….and just blow away the entire heap via HeapFree() with one call, once the context has finished •Cons: oYou technically couldn’t do this: def work(): for x in xrange(0, 1000000000): … o(Why would you!) •So… there’s no point referencing counting objects allocated within parallel contexts! """ So basically, pyparallel solves the issue of garbage collection in a multi-threaded process by not doing garbage collection: yeah, sure, things get a lot simpler, but in real life, you do want to have loops such as above, I don't see how one could pretend otherwise. That's simply a show-stopper to me. In fact, I find the whole programming model completely puzzling. Depending on whether you're in the main thread or not: - you're only able to write to thread-local data (thread-local is the sense allocated by the current thread, not thread-specific): what happens if some parallel context calls "import foo"? - you won't be able to allocate/free many objects So, in fact, in your parallel contexts, you can do so little that it's IMO almost useless in practice. It's not Python - more like a cross between Haskell and Python - and moreover, it means that some code cannot be executed in parallel context. Which means that you basically cannot use any library, since you don't know what's doing under the hood (it might die with a MemoryError or an invalid write to main thread memory). Compare this to e.g. Go's goroutines and channels, and you'll see how one might solve those issues in a sensible way (including user-level thread multiplexing over kernel-level threads, and using epoll ;-). In short, I'm really skeptical, to say the least... cf

On Sun, 17 Nov 2013 16:35:23 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
This is a bit of an oversimplification. The cost of processes is not only the cost of spawning them. There is also the CPU cost of marshalling data between processes, and the memory cost of having duplicate structures and data in your various processes. (also, note that using a process pool generally amortizes the spawning cost quite well) Regards Antoine.

On Nov 17, 2013, at 2:30, Antoine Pitrou <solipsis@pitrou.net> wrote:
But PyParallel doesn't seem to provide _any_ way to pass data between threads. So, the fact that multiprocessing provides only a slow way to pass data between processes can't be considered a weakness. Any program that could be written in PyParallel can't see those costs.

(I saw that there were a number of additional e-mails echo'ing Guido's sentiment/concerns re: shared nothing. I picked this thread to reply to and tried to provide as much info as possible in lieu of replying to everyone individually.) On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:
I wish you had spent more time on explaining how IOCP works and less on judging other approaches.
Heh, it's funny, with previous presentations, I didn't labor the point anywhere near as much, and I found that when presenting to UNIX people, they were very defensive of the status quo. I probably over-compensated a little too much in the opposite direction this time; I don't think anyone is going to argue vehemently that the UNIX status quo is optimal on Windows; but a side-effect is that it unnecessarily slanders existing bodies of work (Twisted et al) that have undeniably improved the overall ecosystem over the past decade.
The only thing I'd add is that, when speaking in terms of socket servers and whatnot, it helps to visualize Python callbacks as "the bits of logic that need to run before invoking the next asynchronous call". Anything I/O related can be done via an asynchronous call; that's basically the exit point of the processing thread -- it dispatches the async WSARecv() (for example), then moves onto the next request in the I/O completion port's queue. When that WSARecv() returns, we get all the info we need from the completion context to figure out what we just did, and, based on the protocol we provided, what needs to be done next. So, we do a little more pure Python processing and then dispatch the next asynchronous call, which, in this case, might be a WSASend(); the thread then moves onto the next request in the queue. That's all handled by the PxSocket_IOLoop monstrosity: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l62... I got the inspiration for that implementation from CEval_FrameEx; you basically have one big inline method where you can go from anything to anything without having to call additional C functions; thus, doing back-to-back sends, for example, won't exhaust your stack. That allows us to do the dynamic switch between sync and async depending on protocol preference, current client load, number of active I/O hogs, that sort of thing: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l64... PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which limits its applicability. I want to expand that -- I should be able to connect any sort of end points together in any fashion -- similar to how ZeroMQ allows the bridge/fan-out/router type composition. An endpoint would be anything that allows me to initiate an async operation against it, e.g. file, device, socket, whatever. This is where Windows really shines, because you can literally do everything either synchronously or asynchronously. There should also be support for 1:m and m:n relationships between endpoints (i.e. an IRC chat server). So I see PxSocket_IOLoop turning into a more generic PxThread_Loop that can handle anything-to-anything -- basically, calling the Python code that needs to run before dispatching the next async call. The current implementation also does a lot of live introspection against the protocol object to figure out what to do next; i.e. first entry point for a newly-connected client is here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... At every entry point into the loop, and at every point *after* the relevant Python code has been run, we're relying on the protocol to tell us what to do next in a very hard-coded fashion. I think for PxThread_Loop to become truly dynamic, it should mirror CEval_FrameEx even closer; the protocol analysis should be done separately, the output of which is a stream of async-opcode bytes that direct the main dispatching logic: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l63... dispatch: switch (next_opcode) { TARGET(maybe_shutdown_send_or_recv); TARGET(handle_error); TARGET(connection_made_callback); TARGET(data_received_callback); TARGET(send_complete_callback); TARGET(overlapped_recv_callback); TARGET(post_callback_that_supports_sending_retval); TARGET(post_callback_that_does_not_support_sending_retval); TARGET(close_); TARGET(try_send); default: break; } Then we'd have one big case statement just like with CEval_FrameEx that handles all possible async-opcodes, rather than the goto spaghetti in the current PxSocket_IOLoop. The async opcodes would be generic and platform-independent; i.e. file write, file read, single socket write, multi-socket write, etc. On Windows/Solaris/AIX, everything could be handled asynchronously, on other platforms, you would have to fake it using an event loop + multiplex method, identical to how twisted/tornado/tulip do it currently.
Ok, so, heh, I lied in the presentation. The main thread won't be frozen per-se, and the parallel threads will have a way to share state. I've already done a huge amount of work on this, but it's very involved and that presentation was long enough as it is. Also, it's easier to understand why reference counting and GC isn't needed in parallel contexts if you just assume the main thread isn't running. In reality, one of the first things I had to figure out was how these parallel contexts could communicate state back to the main thread -- because without this ability, how the heck would you propagate an exception raised in a parallel thread back to the main thread? The exception will be backed my memory allocated in the parallel context -- that can't be free'd until the exception has been dealt with and no references to it remain. As that was one of the first problems I had to solve, it has one of the hackiest implementations :-) The main thread's async.run_once() implementation can detect which parallel threads raised exceptions (because they've done an interlocked push to the main thread's error list) and it will extend the lifetime of the context for an additional number of subsequent runs of run_once(). Once the TTL of the context drops to 0, it is finally released. The reason it's hacky is because there's no direct correlation between the exception object finally having no references to it and the point we destroy the context. If you persisted the exception object to a list in the main thread somewhere, you'd segfault down the track when trying to access that memory. So, on the second iteration, I came up with some new concepts; context persistence and object promotion. A main-thread list or dict could be async protected such that this would work: # main thread d1 = {} l1 = [] async.protect(d1) async.protect(l1) # this would also work d2 = async.dict() l2 = async.list() # fyi: async.rdtsc() returns a PyLong wrapped # version of the CPU TSC; handy for generating # non-interned objects allocated from a parallel # context def callback(name): d1[name] = async.rdtsc() l1.append(async.rdtsc()) async.submit_work(callback, 'foo') async.submit_work(callback, 'bar') async.submit_work(callback, 'moo') async.run() That actually works; the async.protect() call intercepts the object's tp_as_mapping and tp_as_sequence fields and redirects them to thread-safe versions that use read/write locks. It also toggles a persistence bit on both the parallel context and the parallel long object, such that reference counting *is* actually enabled on it once it's back in the main thread -- when the ref count drops to 0, we check to see if it's an object that's been persisted, and if so, we decref the original context -- when the context's refcount gets to 0, only *then* do we free it. (I also did some stuff where you could promote simple objects where it made sense -- i.e. there's no need to keep a 4k context around if the end result was a scalar that could be represented in 50-200 bytes; just memcpy it from the main thread ("promote it to a main thread object with reference counting") and free the context.) You can see some examples of the different type of stuff you can do here: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_pri... The problem though was that none of my unit tests assigned more than ten items to a list/dict, so I never encountered a resize :-) You can imagine what happens when a resize takes place within a parallel context -- the list/dict is realloc'd using the parallel context heap allocator -- that's not ideal, it's a main thread object, it shouldn't be reallocated with temporary parallel thread memory. I think that was the point where I went "oh, bollocks!" and switched over to tackling the async socket stuff. However, the async socket work forced me to implement all sorts of new concepts, including the heap snapshots and TLS heap overrides (for interned strings). Pair that with the page locking stuff and I have a much richer set of tools at my disposal to solve that problem -- I just need to completely overhaul everything memory related now that I know how it needs to be implemented :-) As for the dict/list assignment/resize, the key problem is figuring out whether a PyObject_Realloc call is taking place because we're resizing a main thread container object -- that's not an easy thing to figure out -- all you have is a pointer at the time you need to make the decision. That's where the memory refactoring work comes in -- I'm still working on the details, but the general idea is that you'll be able to do very efficient pointer address tests against known base address masks to figure out the origin of the object and how the current memory request needs to be satisfied. The other option I played around with was an interlocked list type that is exposed directly to Python: x = async.xlist() def callback(): x.push(async.rdtsc()) for _ in xrange(0, 10): async.submit_work(callback) async.run() # interlocked flush of all results into a list. results = x.flush() The key difference between an interlocked list and a normal list is that an interlocked list has its very own localized heap, just like parallel contexts have; pushing a scalar onto the list automatically "promotes it". That is, the object is memcpy'd directly using the xlist's heap, and we can keep that heap alive independently to the parallel contexts that pushed objects onto it. I was also planning on using this as a waitable queue, so you could compose pipelines of producers/consumers and that sort of thing. Then I ran out of time :-)
Agree with the general sentiment "if that's all you're doing, why use Python?". The async HTTP server should allow other things to be built on top of it such that it's adding value over and above, say, an apache instance serving static files.
So, I think I already answered that above. The next presentation (PyCon Montreal) will be purely focused on this stuff -- I've been beating the alternate approach to async I/O for long enough ;-)
Basically it's all still work in progress, but the PyParallel-for- parallel-compute use case is very important. And there's no way that can be done without having a way to return the results of parallel computation back into the next stage of your pipeline where more analysis is done. Getting hired by Continuum is actually great for this use case; we're in the big data, parallel task decomposition space, after all, not the writing-async-socket-server business ;-) I know Peter and Travis are both very supportive of PyParallel so its just a matter of trying to find time to work on it between consultancy engagements. Trent.

On 18 Nov 2013 08:35, "Trent Nelson" <trent@snakebite.org> wrote:
Sweet, this is basically the Rust memory model, which is the direction I'd hoped you would end up going with this (hence why I was asking if you had looked into the details of Rust at the PyCon language summit). For anyone that hasn't looked at Rust, all variables are thread local by default. There are then two mechanisms for sharing with other threads: ownership transfer and promotion to the shared heap. All of this is baked into the compiler, so things like trying to access an object after sending it to another thread trigger a compile error. PyParallel has the additional complication of remaining compatible with standard code that assumes shared memory by default when running in serial mode, but that appears to be a manageable problem. Cheers, Nick.

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:
I don't think they are attempting to attack the same problem. asyncio and similar frameworks (Twisted, Tornado, etc.) try to solve the issue of I/O concurrency, while you are trying to solve the issue of CPU parallelism (i.e. want Python to actually exploit several CPUs simutaneously: asyncio doesn't really care about that, although it has a primitive to let you communicate with subprocesses). Yes, you can want to "optimize" static data serving by using several CPU cores at once, but that sounds quite pointless except perhaps for a few niche situations (and as Guido says, there are perfectly good off-the-shelf solutions for efficient static data serving). I think most people who'd like the GIL removed are not I/O-bound. Regards Antoine.

On Sat, 16 Nov 2013 21:24:56 -0500 Trent Nelson <trent@snakebite.org> wrote:
I've just read the slides. You've done rather weird and audacious things. That was a very interesting read, thank you! Regards Antoine.
participants (6)
-
Andrew Barnert
-
Antoine Pitrou
-
Charles-François Natali
-
Guido van Rossum
-
Nick Coghlan
-
Trent Nelson