The future of Python parallelism. The GIL. Subinterpreters. Actors.

In the past I have personally viewed Python as difficult to use for parallel applications, which need to do multiple things simultaneously for increased performance: * The old Threads, Locks, & Shared State model is inefficient in Python due to the GIL, which limits CPU usage to only one thread at a time (ignoring certain functions implemented in C, such as I/O). * The Actor model can be used with some effort via the “multiprocessing” module, but it doesn’t seem that streamlined and forces there to be a separate OS process per line of execution, which is relatively expensive. I was thinking it would be nice if there was a better way to implement the Actor model, with multiple lines of execution in the same process, yet avoiding contention from the GIL. This implies a separate GIL for each line of execution (to eliminate contention) and a controlled way to exchange data between different lines of execution. So I was thinking of proposing a design for implementing such a system. Or at least get interested parties thinking about such a system. With some additional research I notice that [PEP 554] (“Multiple subinterpeters in the stdlib”) appears to be putting forward a design similar to the one I described. I notice however it mentions that subinterpreters currently share the GIL, which would seem to make them unusable for parallel scenarios due to GIL contention. I'd like to solicit some feedback on what might be the most efficient way to make forward progress on efficient parallelization in Python inside the same OS process. The most promising areas appear to be: 1. Make the current subinterpreter implementation in Python have more complete isolation, sharing almost no state between subinterpreters. In particular not sharing the GIL. The "Interpreter Isolation" section of PEP 554 enumerates areas that are currently shared, some of which probably shouldn't be. 2. Give up on making things work inside the same OS process and rather focus on implementing better abstractions on top of the existing multiprocessing API so that the actor model is easier to program against. For example, providing some notion of Channels to communicate between lines of execution, a way to monitor the number of Messages waiting in each channel for throughput profiling and diagnostics, Supervision, etc. In particular I could do this by using an existing library like Pykka or Thespian and extending it where necessary. Thoughts? [PEP 554]: https://www.python.org/dev/peps/pep-0554/ -- David Foster | Seattle, WA, USA

On 9 July 2018 at 04:27, David Foster <davidfstr@gmail.com> wrote:
Yep, that's basically the way Eric and I and a few others have been thinking. Eric started off this year's language summit with a presentation on the topic: https://lwn.net/Articles/754162/ The intent behind PEP 554 is to eventually get to a point where each subinterpreter has its own dedicated eval loop lock, and the GIL either disappears entirely (replaced by smaller purpose specific locks) or becomes a read/write lock (where write access is only needed to adjust certain state that is shared across subinterpreters). On the multiprocessing front, it could be quite interesting to attempt to adapt the channel API from PEP 554 to the https://docs.python.org/3/library/multiprocessing.html#module-multiprocessin... data sharing capabilities in the modern multiprocessing module. Also of relevance is Antoine Pitrou's work on a new version of the pickle protocol that allows for out-of-band data sharing to avoid redundant memory copies: https://www.python.org/dev/peps/pep-0574/ Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jul 08, 2018 at 11:27:08AM -0700, David Foster wrote:
You might find PyParallel interesting, at least from a "here's what was tried, it worked, but we're not doing it like that" perspective. http://pyparallel.org https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... I still think it was a pretty successful proof-of-concept regarding removing the GIL without having to actually remove it. Performance was pretty good too, as you can see in those graphs.
-- David Foster | Seattle, WA, USA
Regards, Trent. -- https://trent.me

I was not aware of PyParallel. The PyParellel "parallel thread" line-of-execution implementation is pretty interesting. Trent, big kudos to you on that effort. Since you're speaking in the past tense and said "but we're not doing it like that", I infer that the notion of a parallel thread was turned down for integration into CPython, as that appears to have been the original goal. However I am unable to locate a rationale for why that integration was turned down. Was it deemed to be too complex to execute, perhaps in the context of providing C extension compatibility? Was there a desire to see a similar implementation on Linux as well as Windows? Some other reason? Since I presume you were directly involved in the discussions, perhaps you have a link to the relevant thread handy? The last update I see from you RE PyParallel on this list is: https://mail.python.org/pipermail/python-ideas/2015-September/035725.html David Foster | Seattle, WA, USA On 7/9/18 9:17 AM, Trent Nelson wrote:

On 7/10/2018 10:31 AM, David Foster wrote:
A far as I remember, there was never a formal proposal (PEP). And I just searched PEP 0 for 'parallel'. Hence, no formal rejection, rationale, or thread.
As always, there may have been private, off-the-record, informal discussions. -- Terry Jan Reedy

On Tue, Jul 10, 2018 at 8:32 AM David Foster <davidfstr@gmail.com> wrote:
+1 It's a neat project. Trent's pretty smart. :)
Trent can correct me if I'm wrong, but I believe it boiled down to challenges with the POSIX implementation (that email thread implies this as well), likely coupled with limited time for Trent to work on it. -eric

On 11 July 2018 at 00:31, David Foster <davidfstr@gmail.com> wrote:
It was never extended beyond Windows, and a Windows-only solution doesn't meet the needs of a lot of folks interested in more efficient exploitation of multiple local CPU cores. It's still an interesting design concept though, especially for problems that can be deconstructed into a setup phase (read/write main thread), and a parallel operation phase (ephemeral worker threads that store all persistent state in memory mapped files, or otherwise outside the current process). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

No one talked about this, but modern cpus with multiple numa nodes are atrocious to any shared memory (maybe threadripper is better, but multiple socket xeon is slow) and more and more all cpus will move to it on single chips, so a share nothing aproach can really make python a good contender on modern hardware. On Sun, Jul 15, 2018 at 6:20 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- Leonardo Santagada

(Apologies for the slow reply, I'm in the middle of a relocation at the moment so e-mail access isn't consistent, and will be a lot worse over the next few weeks.) On Tue, Jul 10, 2018 at 07:31:49AM -0700, David Foster wrote:
PyParallel was... ambitious to say the least. When I started it, I sort of *hand wavy* envisioned it would lead to something that I could formally pitch to python-dev@. But there was a lot of blissful ignorance of the ensuing complexity in that initial sentiment, though. So, nothing was formally turned down by core developers, as I never really ended up pitching something formal that could be assessed for inclusion. By the time I'd developed something that was at least an alpha-level proof-of-concept, I had to make 50+ pretty sizable implementation decisions that would have warranted their own PEP if the work ever made it into the mainline Python. I definitely think a PyParallel-esque approach (where we play it fast and loose with what's considered the GIL, how and when reference counting is done, etc.) is the only viable *performant* option we have for solving the problem -- i.e. I can't see how a "remove the GIL, introduce fine grained locking, use interlocked ops for ref counts"-type conventional approach will ever yield acceptable performance. But, yeah, I'm not optimistic we'll see a solution actually in the mainline Python any time soon. I logged about 2500 hours of development time hacking PyParallel into it's initial alpha proof-of-concept state. It only worked on one operating system, required intimate knowledge of Python innards (which I lacked at the start), and exposed a very brittle socket-server oriented interface to leverage the parallelism (there was no parallel compute/free-threading type support provided, really). I can't think of how we'll arrive at something production quality without it being a multi-year, many-developer (full time, ideally located in proximity to each other) project. I think you'd really need a BDFL Guido/Linus/Cutler-type lead driving the whole effort too, as there will be a lot of tough, dividing decisions that need to be made. How would that be funded?! It's almost a bit of a moon-shot type project. Definitely high-risk. There's no precedent for the PSF funding such projects, nor large corporate entities (i.e. Google, Amazon, Microsoft). What's the ROI for those companies to take on so much cost and risk? Perhaps if the end solution only ran on their cloud infrastructure (Azure, AWS, GCS) -- maybe at least initially. That... that would be an interesting turn of events. Maybe we just wait 20 years 'til a NumPy/SciPy/Z3-stack does some cloud AI stuff to "solve" which parts of an existing program can be executed in parallel without any user/developer assistance :-)
David Foster | Seattle, WA, USA
Regards, Trent. -- https://trent.me

On Sun, Jul 8, 2018 at 12:30 PM David Foster <davidfstr@gmail.com> wrote:
Yep, Python's multi-core story is a bit rough (Jython/IronPython aside). It's especially hard for folks used to concurrency/parallelism in other languages. I'm hopeful that we can improve the situation.
I was thinking it would be nice if there was a better way to implement the Actor model, with multiple lines of execution in the same process,
FWIW, at this point I'm a big fan of this concurrency model. I find it hurts my brain least. :)
I'm glad you found PEP 554. I wanted to keep the PEP focused on exposing the existing subinterpreter support (and the basic, CSP-inspired concurrency model), which is why it doesn't go into much detail about changes to the CPython runtime that will allow GIL-free multi-core parallelism. As Nick mentioned, my talk at the language summit covers my plans. Improving Python's multi-core story has been the major focus of my (sadly relatively small) contributions to CPython for several years now. I've made slow progress due to limited time, but things are picking up, especially since I got a job in December at Microsoft that allows me to work on CPython for part of each week. On top of that, several other people are directly helping now (including Emily Morehouse) and I got a lot of positive feedback for the project at PyCon this year.
Right, this is the approach I'm driving. At this point I have the project broken down pretty well into manageable chunks. You're welcome to join in. :) Regardless, I'd be glad to discuss it with you in more depth if you're interested.
It may worth a shot. You should ask Davin Potts (CC'ed) about this. We discussed this a little at PyCon. I'm sure he'd welcome help in improving the multiprocessing module. -eric

On 7/14/2018 5:40 AM, Antoine Pitrou wrote:
It's good to know that there is an active coredev who can be added as nosy on multiprocessing issues. The multiprocessing line in the Expert's Index, https://devguide.python.org/experts/ has Davin *ed (assign issues to him) and you not (nosy only). Perhaps Davin should be un-starred. Some time ago, on pydev list, you suggested that the solution to IDLE's problems with subprocess and sockets might be to use multiprocessing and pipes. I noticed then that there were numerous bug report and little activity and wondered then how usable multiprocessing was in practice. Checking again, there are 52 open behavior and 6 open crash issues with 'multiprocessing' in the title. The must severe one for IDLE that I noticed is #33111: importing tkinter and running multiprocessing on MacOS does not seem to work. This week I (re)read the main multiprocessing doc chapter. The main issue I saw is 'Beware of replacing sys.stdin with a “file like object”'. I don't *think* that this is a showstopper, but a minimal failing example would help to be sure. -- Terry Jan Reedy

On Sun, Jul 8, 2018 at 11:27 AM, David Foster <davidfstr@gmail.com> wrote:
What do you mean by "the Actor model"? Just shared-nothing concurrency? (My understanding is that in academia it means shared-nothing + every thread/process/whatever gets an associated queue + queues are globally addressable + queues have unbounded buffering + every thread/process/whatever is implemented as a loop that reads messages from its queue and responds to them, with no internal concurrency. I don't know why this particular bundle of features is considered special. Lots of people seem to use it in looser sense though.)
I guess I would distinguish though between "multiple processes" and "the multiprocessing module". The module might be at the point in its lifecycle where starting over is at least worth considering, and one thing I'm hoping to do with Trio is experiment with making worker process patterns easier to work with. But the nice thing about these two options is that subinterpreters are basically a way to emulate multiple Python processes within a single OS process, which means they're largely interchangeable. There are trade-offs in terms of compatibility, how much work needs to be done, probably speed, but if you come up with a great API based around one model then you should be able to switch out the backend later without affecting users. So if you want to start experimenting now, I'd use multiple processes and plan to switch to subinterpreters later if it turns out to make sense. -n -- Nathaniel J. Smith -- https://vorpus.org

On Mon, Jul 16, 2018 at 10:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
Shared-nothing concurrency is, of course, the very easiest way to parallelize. But let's suppose you're trying to create an online multiplayer game. Since it's a popular genre at the moment, I'll go for a battle royale game (think PUBG, H1Z1, Fortnite, etc). A hundred people enter; one leaves. The game has to let those hundred people interact, which means that all hundred people have to be connected to the same server. And you have to process everyone's movements, gunshots, projectiles, etc, etc, etc, fast enough to be able to run a server "tick" enough times per second - I would say 32 ticks per second is an absolute minimum, 64 is definitely better. So what happens when the processing required takes more than one CPU core for 1/32 seconds? A shared-nothing model is either fundamentally impossible, or a meaningless abstraction (if you interpret it to mean "explicit queues/pipes for everything"). What would the "Actor" model do here? Ideally, I would like to be able to write my code as a set of functions, then easily spin them off as separate threads, and have them able to magically run across separate CPUs. Unicorns not being a thing, I'm okay with warping my code a bit around the need for parallelism, but I'm not sure how best to do that. Assume here that we can't cheat by getting most of the processing work done with the GIL released (eg in Numpy), and it actually does require Python-level parallelism of CPU-heavy work. ChrisA

On Sun, Jul 15, 2018 at 6:00 PM, Chris Angelico <rosuav@gmail.com> wrote:
"Shared-nothing" is a bit of jargon that means there's no *implicit* sharing; your threads can still communicate, the communication just has to be explicit. I don't know exactly what algorithms your hypothetical game needs, but they might be totally fine in a shared-nothing approach. It's not just for embarrassingly parallel problems.
If you need shared-memory threads, on multiple cores, for CPU-bound logic, where the logic is implemented in Python, then yeah, you basically need a free-threaded implementation of Python. Jython is such an implementation. PyPy could be if anyone were interested in funding it [1], but apparently no-one is. Probably removing the GIL from CPython is impossible. (I'd be happy to be proven wrong.) Sorry I don't have anything better to report. The good news is that there are many, many situations where you don't actually need "shared-memory threads, on multiple cores, for CPU-bound logic, where the logic is implemented in Python". If you're in that specific niche and don't have $100k to throw at PyPy, then I dunno, I hear Rust is good at that sort of thing? It's frustrating for sure, but there will always be niches where Python isn't the best choice. -n [1] https://morepypy.blogspot.com/2017/08/lets-remove-global-interpreter-lock.ht... -- Nathaniel J. Smith -- https://vorpus.org

On Mon, Jul 16, 2018 at 1:21 PM, Nathaniel Smith <njs@pobox.com> wrote:
Right, so basically it's the exact model that Python *already* has for multiprocessing - once you go to separate processes, nothing is implicitly shared, and everything has to be done with queues.
(This was a purely hypothetical example.) There could be some interesting results from using the GIL only for truly global objects, and then having other objects guarded by arena locks. The trouble is that, in CPython, as soon as you reference any read-only object from the globals, you need to raise its refcount. ISTR someone mentioned something along the lines of sys.eternalize(obj) to flag something as "never GC this thing, it no longer has a refcount", which would then allow global objects to be referenced in a truly read-only way (eg to call a function). Sadly, I'm not expert enough to actually look into implementing it, but it does seem like a very cool concept. It also fits into the "warping my code a bit" category (eg eternalizing a small handful of key objects, and paying the price of "well, now they can never be garbage collected"), with the potential to then parallelize more easily.
Oh absolutely. MOST of my parallelism requirements involve regular Python threads, because they spend most of their time blocked on something. That one is easy. The hassle comes when something MIGHT need parallelism and might not, based on (say) how much data it has to work with; for those kinds of programs, I would like to be able to code it the simple way with minimal code overhead, but still able to split over cores. And yes, I'm aware that it's never going to be perfect, but the closer the better. ChrisA

What about the following model: you have N Python interpreters, each with their own GIL. Each *Python* object belongs to precisely one interpreter. However, the interpreters share some common data storage: perhaps a shared Numpy array, or a shared Sqlite in-memory db. Or some key-value store where the key and values are binary data. The interpreters communicate through that. Stephan Op ma 16 jul. 2018 06:25 schreef Chris Angelico <rosuav@gmail.com>:

On Mon, Jul 16, 2018 at 3:00 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Interesting. The actual concrete idea that I had in mind was an image comparison job, downloading umpteen billion separate images and trying to find the one most similar to a template. Due to lack of easy parallelism I was unable to then compare the top thousand against each other quadratically, but it would be interesting to see if I could have done something like that to share image comparison information. ChrisA

On Mon, 16 Jul 2018 07:00:34 +0200 Stephan Houben <stephanh42@gmail.com> wrote:
What about the following model: you have N Python interpreters, each with their own GIL. Each *Python* object belongs to precisely one interpreter.
This is roughly what Eric's subinterpreters approach tries to do. Regards Antoine.

On 2018-07-16 05:24, Chris Angelico wrote:
Could you explicitly share an object in a similar way to how you explicitly open a file? The shared object's refcount would be incremented and the sharing function would return a proxy to the shared object. Refcounting in the thread/process would be done on the proxy. When the proxy is closed or garbage-collected, the shared object's refcount would be decremented. The shared object could be garbage-collected when its refcount drops to zero.

On Mon, 16 Jul 2018 18:00:37 +0100 MRAB <python@mrabarnett.plus.com> wrote:
Yes, I'm assuming that would be how shareable buffers could be implemented: a per-interpreter proxy (with a regular Python refcount) mediating access to a shared object (which could have an atomic / thread-safe refcount). As for how shareable buffers could be useful, see my work on PEP 574: https://www.python.org/dev/peps/pep-0574/ Regards Antoine.

On Mon, Jul 16, 2018 at 11:08 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
Nice! That's exactly how I'm doing it. :) The buffer protocol makes it easier, but the idea could apply to arbitrary objects generally. That's something I'll look into in a later phase of the project. In both cases the tricky part is ensuring that the proxy does not directly mutate the object (especially the refcount). In fact, the decref part above is the trickiest. The trickiness is a consequence of our goals. In my multi-core project we're aiming for not sharing the GIL between interpreters. That means reaching and keeping proper separation between interpreters. Notably, without a GIL shared by interpreters, refcount operations are not thread-safe. Also, in the decref case GC would happen under the wrong interpreter (which is problematic for several reasons). With this in mind, here's how I'm approaching the problem: 1. interp A "shares" an object with interp B (e.g. through a channel) * the object is incref'ed under A before it is sent to B 2. the object is wrapped in a proxy owned by B * the proxy may not make C-API calls that would mutate the object or even cause an incref/decref 3. when the proxy is GC'd, the original object is decref'ed * the decref must happen in a thread in which A is running In order to make all this work the missing piece is a mechanism by which the decref (#3) happens under the original interpreter. At the moment Emily Morehouse and I are pursuing an approach that extends the existing ceval "pending call" machinery currently used for handling signals (see Py_AddPendingCall). The new [*private*] API would work the same way but on a per-interpreter basis rather than just the main interpreter. This would allow one interpreter to queue up a decref to happen later under another interpreter. FWIW, this ability to decref an object under a different interpreter is a blocker right now for a number of things, including supporting buffers in PEP 554 channels. -eric

On Tue, Jul 17, 2018 at 1:44 PM Barry <barry@barrys-emacs.org> wrote:
The decrement itself is not the problem, that can be made thread safe.
Yeah, by using the GIL. <wink> Otherwise, please elaborate. My understanding is that if the decrement itself were not the problem then we'd have gotten rid of the GIL already.
Do you mean that once the ref reaches 0 you have to make the delete happen on the original interpreter?
Yep. For one thing, GC can trigger __del__, which can do anything, including modifying other objects from the original interpreter (incl. decref'ing them). __del__ should be run under the original interpreter. For another thing, during GC containers often decref their items. Also, separating the GIL between interpreters may mean we'll need an allocator per interpreter. In that case the deallocation must happen relative to the interpreter where the object was allocated. -eric

All processors have thread safe ways to inc and dec and test, integers without holding a lock. That is the mechanism that locks themselves are built out of. You can use that to avoid holding the GIL until the ref count reaches 0. In c++ they built it into the language with std::atomic_int, you would have to find the way to do this C, i don’t have an answer at my finger tips for C. Barry

Op 18 jul. 2018 om 08:02 heeft Barry <barry@barrys-emacs.org> het volgende geschreven:
Some past attempts at getting rid of the GIL used atomic inc/dec, and that resulted in bad performance because these instructions aren’t cheap. My gut feeling is that you’d have to get rid of refcounts to get high performance when getting rid of the GIL in a single interpreter, which would almost certainly result in breaking the C API. Ronald

Isn't this class of problem what leads to the per-processor caches and other optimisations in Linux kernel? I wonder if kernel optimisations could be applied to this problem?
My gut feeling is that you’d have to get rid of refcounts to get high performance when getting rid of the GIL in a single interpreter, which would almost certainly result in breaking the C API.
Working on the ref count costs might be the enabling tech. We already have the problem of unchanging objects being copied after a fork because of the ref counts being inside the object. It was suggested that the ref count would have to move out of the object to help with this problem. If there is a desirable solution to the parallel problem we can think about the C API migration problem. Barry

On Wed, 18 Jul 2018 08:21:31 +0100 Ronald Oussoren via Python-ideas <python-ideas@python.org> wrote:
Some past attempts at getting rid of the GIL used atomic inc/dec, and that resulted in bad performance because these instructions aren’t cheap.
Please read in context: we are not talking about making all refcounts atomic, only a couple refcounts on shared objects (which probably won't be Python objects, actually). Regards Antoine.

Let me try a longer answer. The inc+test and dec+test do not require a lock if coded correctly. All OS and run times have solved this to provide locks. All processors provide the instructions that are the building blocks for lock primitives. You cannot mutate a mutable python object that is not protected with the GIL as the change of state involves multiple parts of the object changing. If you know that an object is immutable then you could only do a check on the ref count as you will never change the state of the object beyond its ref count. To access the object you only have to ensure it will not be deleted, which the ref count guarantees. The delete of the immutable object is then the only job that the original interpreter must do.
Yep that I understand. Barry
-eric

On Wed, Jul 18, 2018 at 1:37 AM Barry Scott <barry@barrys-emacs.org> wrote:
Perhaps we're agreeing? Other than the single decref at when "releasing" the object, it won't ever be directly modified (even the refcount) in the other interpreter. In effect that interpreter holds a reference to the object which prevents GC in the "owning" interpreter (the corresponding incref happened in that original interpreter before the object was "shared"). The only issue is how to "release" the object in the other interpreter so that the decref happens in the "owning" interpreter. As earlier noted, I'm planning on taking advantage of the exiting ceval "pending calls" machinery. So I'm not sure where an atomic int would factor in. If you mean switching the exiting refcount to an atomic int for the sake of the cross-interpreter decref then that's not going to happen, as Ronald suggested. Larry could tell you about his Gilectomy experience. :) Are you suggesting something like a second "cross-interpreter refcount", which would be atomic, and add a check in Py_DECREF? That would imply an extra cross-interpreter-oriented C-API to parallel Py_DECREF. It would also mean either adding another field to PyObject (yikes!) or keeping a separate table for tracking cross-interpreter references. I'm not sure any of that would be better than the alternative I'm pursuing. Then again, I've considered tracking which interpreters hold a "reference" to an object, which isn't that different. -eric

Hi Eric, Antoine, all Antoine said that what I proposed earlier was very similar to what Eric is trying to do, but from the direction the discussion has taken so far that appears not to be the case. I will therefore try to clarify my proposal. Basically, what I am suggesting is a direct translation of Javascript's Web Worker API ( https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API) to Python. The Web Worker API is generally considered a "share-nothing" approach, although as we will see some state can be shared. The basic principle is that any object lives in a single Worker (Worker = subinterpreter). If a message is send from Worker A to Worker B, the message is not shared, rather the so-called "structured clone" algorithm is used to create recursively a NEW message object in Worker B. This is roughly equivalent to pickling in A and then unpickling in B, Of course, this may become a bottleneck if large amounts of data need to be communicated. Therefore, there is a special object type designed to provide a view upon a piece of shared memory: SharedArrayBuffer. Notable, this only provides a view upon raw "C"-style data (ints or floats or whatever), not on Javascript objects. To translate this to the Python situation: each Python object is owned by a single subinterpreter, and may only be manipulated by a thread which holds the GIL of that particular subinterpreter. Message sending between subinterpreters will require the message objects to be "structured cloned". Certain C extension types may override what structured cloning means for them. In particular, some C extension types may have a two-layer structure where the Py_Object contains a refcounted pointer to the actual data. The structured cloning on such an object may create a second Py_Object which references the same underlying object. This secondary refcount will need to be properly atomic, since it may be manipulated from multiple subinterpreters. In this way, interpreter-shared data structures can be implemented. However, all the "normal" Python objects are not shared and can continue to use the current, non-atomic refcounting implementation. Hope this clarifies my proposal. Stephan 2018-07-18 19:58 GMT+02:00 Eric Snow <ericsnowcurrently@gmail.com>:

Hi Python in the age of the multi-core processor is an important question. And garbage collection is one of the many issues involved. I've been thinking about the garbage collection problem, and lurking on this list, for a while. I think it's now about time I showed myself, and shared my thoughts. I intend to do this in a new thread, dealing only with the problem of multi-core reference counting garbage collection. I hope you don't mind my doing this. Expect the first instalment tomorrow. with best regards Jonathan

On Wed, Jul 18, 2018 at 12:49 PM Stephan Houben <stephanh42@gmail.com> wrote:
It looks like we are after the same thing actually. :) Sorry for any confusion. There are currently no provisions for actually sharing objects between interpreters. In fact, initially the plan is basically to support sharing copies of basic builtin immuntable types. The question of refcounts comes in when we actually do share underlying data of immutable objects (e.g. the buffer protocol).
Yes, there's a strong parallel to that model here. In fact, I mentioned web workers in my language summit talk at PyCon 2018.
That is exactly what the channels in the PEP 554 implementation do, though much more efficiently than pickling. Initial support will be for basic builtin immutable types. We can later consider support for other (even arbitrary?) types, but anything beyond copying (e.g. pickle) is way off my radar. Python's C-API is so closely tied to refcounting that we simply cannot support safely sharing actual Python objects between interpreters once we no longer share the GIL between them.
Yep, that translates to buffers in Python, which is covered by PEP 554 (see SendChannel.send_buffer). In this case, where some underlying data is actually shared, the implementation has to deal with keeping a reference to the original object and releasing it when done, which is what all the talk of refcounts has been about. However, the PEP does not talk about it because it is an implementation detail that is not exposed in Python.
Correct. That is what PEP 554 does. As an aside, your phrasing "may only be manipulated by a thread which holds the GIL of that particular subinterpreter" did spark something I'll consider later: perhaps interpreters can acquire each other's GIL when (infrequently) necessary. That could simplify a few things.
My implementation of PEP 554 supports this, though I have not made the C-API for it public. It's also not part of the PEP. I was considering adding it.
That is correct. That entirely matches what I'm doing with PEP 554. In fact, the isolation between interpreters is critical to my multi-core Python project, of which PEP 554 is a part. It's necessary in order to stop sharing the GIL between interpreters. So actual objects will never be shared between interpreters. They can't be.
Hope this clarifies my proposal.
Yep. Thanks! -eric

On 2018-07-18 20:35, Eric Snow wrote:
What if an object is not going to be shared, but instead "moved" from one subinterpreter to another? The first subinterpreter would no longer have a reference to the object. If the object's refcount is 1 and the object doesn't refer to any other object, then copying would not be necessary. [snip]

On Wed, Jul 18, 2018 at 2:38 PM MRAB <python@mrabarnett.plus.com> wrote:
Yeah, that's something that I'm sure we'll investigate at some point, but it's not part of the short-term plans. This belongs to a whole class of possibilities that we'll explore once we have the basic functionality established. :) FWIW, I don't think that "moving" an object like this would be to hard to implement. -eric

On Wed, Jul 18, 2018 at 11:49 AM, Stephan Houben <stephanh42@gmail.com> wrote:
Note that this everything you said here also exactly describes the programming model for the existing 'multiprocessing' module: "structured clone" is equivalent to how multiprocessing uses pickle to transfer arbitrary objects, or you can use multiprocessing.Array to get a shared view on raw "C"-style data. -n -- Nathaniel J. Smith -- https://vorpus.org

Hi Nathaniel, 2018-07-19 1:33 GMT+02:00 Nathaniel Smith <njs@pobox.com>:
This is true. In fact, I am a big fan of multiprocessing and I think it is often overlooked/underrated. Experience with multiprocessing is also what has me convinced that share-nothing or share-explicit approach to concurrency is a useful programming model. The main limitation of multiprocessing comes when you need to go outside Python, and you need to interact with C/C++ libraries or operating services from multiple processes. The support for this generally varies from "extremely weak" to "none at all". For example, things I would like to in parallel with a main thread/process: * Upload data to the GPU using OpenGL or OpenCL * Generate a picture in pyqt QImage, then hand over zero-copy to main thread * interact with a complex scenegraph in C++ (shared with main thread) This is impossible right now but would be possible if the interpreters were all in-process. In addition, there are things which are now hard with "multiprocessing" but could be fixed. For example, sharing a Numpy array is possible but very inconvenient. You need to first allocate the raw data segment, communicate that, then create in each process an array which uses this data segment. Ideally, this would rather work like this: ar = numpy.zeros((30, 30), shared=True) and then "ar" would automatically be shared. This is fixable but given the other limitations above the question is if it is worthwhile to fix it now. It would be a lot simpler to fix if we had the in-process model. But yeah, I am actually also very open to ideas on how multiprocessing could be made more convenient and powerful. Perhaps there are ways, and I am just not seeing them. Stephan

On 18 July 2018 at 05:35, Eric Snow <ericsnowcurrently@gmail.com> wrote:
Aw, I guess the original idea of just doing an active interpreter context switch in the current thread around the shared object decref operation didn't work out? That's a shame. I'd be curious as to the technical details of what actually failed in that approach, as I would have expected it to at least work, even if the performance might not have been wonderful. (Although thinking about it further now given a per-interpreter locking model, I suspect there could be some wonderful opportunities for cross-interpreter deadlocks that we didn't consider in our initial design sketch...) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Eric,
How does the proxy at the same time make the object accessible and prevent mutation? Would it help if there was an explicit owning thread for each object? I'm thinking that you can do a fast check that the object belongs to the current thread and can use that knowledge to avoid locking. If the object is owned by another thread acquire the GIL in the traditional way and mutating the state will be safe. The "sharing" process can ensure that until an explicit "unsharing" the object remains safe to test in all threads that share an object, avoiding the need for the special processor instructions. Barry

MRAB wrote:
What about other objects accessed through the shared object? They would need to get wrapped in proxies too. Also, if the shared object is mutable, changes to it would need to be protected by a lock of some kind. Maybe all this could be taken care of by the proxy objects, but it seems like it would be quite tricky to get right. -- Greg

On Sun, 15 Jul 2018 20:21:56 -0700 Nathaniel Smith <njs@pobox.com> wrote:
It's not that it's impossible, it's that everyone trying to remove it ended up with a 30-40% slowdown in a single-threaded mode (*). Perhaps Larry manages to do better, though ;-) (*) a figure which I assume is highly workload-dependent Regards Antoine.

On 9 July 2018 at 04:27, David Foster <davidfstr@gmail.com> wrote:
Yep, that's basically the way Eric and I and a few others have been thinking. Eric started off this year's language summit with a presentation on the topic: https://lwn.net/Articles/754162/ The intent behind PEP 554 is to eventually get to a point where each subinterpreter has its own dedicated eval loop lock, and the GIL either disappears entirely (replaced by smaller purpose specific locks) or becomes a read/write lock (where write access is only needed to adjust certain state that is shared across subinterpreters). On the multiprocessing front, it could be quite interesting to attempt to adapt the channel API from PEP 554 to the https://docs.python.org/3/library/multiprocessing.html#module-multiprocessin... data sharing capabilities in the modern multiprocessing module. Also of relevance is Antoine Pitrou's work on a new version of the pickle protocol that allows for out-of-band data sharing to avoid redundant memory copies: https://www.python.org/dev/peps/pep-0574/ Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jul 08, 2018 at 11:27:08AM -0700, David Foster wrote:
You might find PyParallel interesting, at least from a "here's what was tried, it worked, but we're not doing it like that" perspective. http://pyparallel.org https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... I still think it was a pretty successful proof-of-concept regarding removing the GIL without having to actually remove it. Performance was pretty good too, as you can see in those graphs.
-- David Foster | Seattle, WA, USA
Regards, Trent. -- https://trent.me

I was not aware of PyParallel. The PyParellel "parallel thread" line-of-execution implementation is pretty interesting. Trent, big kudos to you on that effort. Since you're speaking in the past tense and said "but we're not doing it like that", I infer that the notion of a parallel thread was turned down for integration into CPython, as that appears to have been the original goal. However I am unable to locate a rationale for why that integration was turned down. Was it deemed to be too complex to execute, perhaps in the context of providing C extension compatibility? Was there a desire to see a similar implementation on Linux as well as Windows? Some other reason? Since I presume you were directly involved in the discussions, perhaps you have a link to the relevant thread handy? The last update I see from you RE PyParallel on this list is: https://mail.python.org/pipermail/python-ideas/2015-September/035725.html David Foster | Seattle, WA, USA On 7/9/18 9:17 AM, Trent Nelson wrote:

On 7/10/2018 10:31 AM, David Foster wrote:
A far as I remember, there was never a formal proposal (PEP). And I just searched PEP 0 for 'parallel'. Hence, no formal rejection, rationale, or thread.
As always, there may have been private, off-the-record, informal discussions. -- Terry Jan Reedy

On Tue, Jul 10, 2018 at 8:32 AM David Foster <davidfstr@gmail.com> wrote:
+1 It's a neat project. Trent's pretty smart. :)
Trent can correct me if I'm wrong, but I believe it boiled down to challenges with the POSIX implementation (that email thread implies this as well), likely coupled with limited time for Trent to work on it. -eric

On 11 July 2018 at 00:31, David Foster <davidfstr@gmail.com> wrote:
It was never extended beyond Windows, and a Windows-only solution doesn't meet the needs of a lot of folks interested in more efficient exploitation of multiple local CPU cores. It's still an interesting design concept though, especially for problems that can be deconstructed into a setup phase (read/write main thread), and a parallel operation phase (ephemeral worker threads that store all persistent state in memory mapped files, or otherwise outside the current process). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

No one talked about this, but modern cpus with multiple numa nodes are atrocious to any shared memory (maybe threadripper is better, but multiple socket xeon is slow) and more and more all cpus will move to it on single chips, so a share nothing aproach can really make python a good contender on modern hardware. On Sun, Jul 15, 2018 at 6:20 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- Leonardo Santagada

(Apologies for the slow reply, I'm in the middle of a relocation at the moment so e-mail access isn't consistent, and will be a lot worse over the next few weeks.) On Tue, Jul 10, 2018 at 07:31:49AM -0700, David Foster wrote:
PyParallel was... ambitious to say the least. When I started it, I sort of *hand wavy* envisioned it would lead to something that I could formally pitch to python-dev@. But there was a lot of blissful ignorance of the ensuing complexity in that initial sentiment, though. So, nothing was formally turned down by core developers, as I never really ended up pitching something formal that could be assessed for inclusion. By the time I'd developed something that was at least an alpha-level proof-of-concept, I had to make 50+ pretty sizable implementation decisions that would have warranted their own PEP if the work ever made it into the mainline Python. I definitely think a PyParallel-esque approach (where we play it fast and loose with what's considered the GIL, how and when reference counting is done, etc.) is the only viable *performant* option we have for solving the problem -- i.e. I can't see how a "remove the GIL, introduce fine grained locking, use interlocked ops for ref counts"-type conventional approach will ever yield acceptable performance. But, yeah, I'm not optimistic we'll see a solution actually in the mainline Python any time soon. I logged about 2500 hours of development time hacking PyParallel into it's initial alpha proof-of-concept state. It only worked on one operating system, required intimate knowledge of Python innards (which I lacked at the start), and exposed a very brittle socket-server oriented interface to leverage the parallelism (there was no parallel compute/free-threading type support provided, really). I can't think of how we'll arrive at something production quality without it being a multi-year, many-developer (full time, ideally located in proximity to each other) project. I think you'd really need a BDFL Guido/Linus/Cutler-type lead driving the whole effort too, as there will be a lot of tough, dividing decisions that need to be made. How would that be funded?! It's almost a bit of a moon-shot type project. Definitely high-risk. There's no precedent for the PSF funding such projects, nor large corporate entities (i.e. Google, Amazon, Microsoft). What's the ROI for those companies to take on so much cost and risk? Perhaps if the end solution only ran on their cloud infrastructure (Azure, AWS, GCS) -- maybe at least initially. That... that would be an interesting turn of events. Maybe we just wait 20 years 'til a NumPy/SciPy/Z3-stack does some cloud AI stuff to "solve" which parts of an existing program can be executed in parallel without any user/developer assistance :-)
David Foster | Seattle, WA, USA
Regards, Trent. -- https://trent.me

On Sun, Jul 8, 2018 at 12:30 PM David Foster <davidfstr@gmail.com> wrote:
Yep, Python's multi-core story is a bit rough (Jython/IronPython aside). It's especially hard for folks used to concurrency/parallelism in other languages. I'm hopeful that we can improve the situation.
I was thinking it would be nice if there was a better way to implement the Actor model, with multiple lines of execution in the same process,
FWIW, at this point I'm a big fan of this concurrency model. I find it hurts my brain least. :)
I'm glad you found PEP 554. I wanted to keep the PEP focused on exposing the existing subinterpreter support (and the basic, CSP-inspired concurrency model), which is why it doesn't go into much detail about changes to the CPython runtime that will allow GIL-free multi-core parallelism. As Nick mentioned, my talk at the language summit covers my plans. Improving Python's multi-core story has been the major focus of my (sadly relatively small) contributions to CPython for several years now. I've made slow progress due to limited time, but things are picking up, especially since I got a job in December at Microsoft that allows me to work on CPython for part of each week. On top of that, several other people are directly helping now (including Emily Morehouse) and I got a lot of positive feedback for the project at PyCon this year.
Right, this is the approach I'm driving. At this point I have the project broken down pretty well into manageable chunks. You're welcome to join in. :) Regardless, I'd be glad to discuss it with you in more depth if you're interested.
It may worth a shot. You should ask Davin Potts (CC'ed) about this. We discussed this a little at PyCon. I'm sure he'd welcome help in improving the multiprocessing module. -eric

On 7/14/2018 5:40 AM, Antoine Pitrou wrote:
It's good to know that there is an active coredev who can be added as nosy on multiprocessing issues. The multiprocessing line in the Expert's Index, https://devguide.python.org/experts/ has Davin *ed (assign issues to him) and you not (nosy only). Perhaps Davin should be un-starred. Some time ago, on pydev list, you suggested that the solution to IDLE's problems with subprocess and sockets might be to use multiprocessing and pipes. I noticed then that there were numerous bug report and little activity and wondered then how usable multiprocessing was in practice. Checking again, there are 52 open behavior and 6 open crash issues with 'multiprocessing' in the title. The must severe one for IDLE that I noticed is #33111: importing tkinter and running multiprocessing on MacOS does not seem to work. This week I (re)read the main multiprocessing doc chapter. The main issue I saw is 'Beware of replacing sys.stdin with a “file like object”'. I don't *think* that this is a showstopper, but a minimal failing example would help to be sure. -- Terry Jan Reedy

On Sun, Jul 8, 2018 at 11:27 AM, David Foster <davidfstr@gmail.com> wrote:
What do you mean by "the Actor model"? Just shared-nothing concurrency? (My understanding is that in academia it means shared-nothing + every thread/process/whatever gets an associated queue + queues are globally addressable + queues have unbounded buffering + every thread/process/whatever is implemented as a loop that reads messages from its queue and responds to them, with no internal concurrency. I don't know why this particular bundle of features is considered special. Lots of people seem to use it in looser sense though.)
I guess I would distinguish though between "multiple processes" and "the multiprocessing module". The module might be at the point in its lifecycle where starting over is at least worth considering, and one thing I'm hoping to do with Trio is experiment with making worker process patterns easier to work with. But the nice thing about these two options is that subinterpreters are basically a way to emulate multiple Python processes within a single OS process, which means they're largely interchangeable. There are trade-offs in terms of compatibility, how much work needs to be done, probably speed, but if you come up with a great API based around one model then you should be able to switch out the backend later without affecting users. So if you want to start experimenting now, I'd use multiple processes and plan to switch to subinterpreters later if it turns out to make sense. -n -- Nathaniel J. Smith -- https://vorpus.org

On Mon, Jul 16, 2018 at 10:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
Shared-nothing concurrency is, of course, the very easiest way to parallelize. But let's suppose you're trying to create an online multiplayer game. Since it's a popular genre at the moment, I'll go for a battle royale game (think PUBG, H1Z1, Fortnite, etc). A hundred people enter; one leaves. The game has to let those hundred people interact, which means that all hundred people have to be connected to the same server. And you have to process everyone's movements, gunshots, projectiles, etc, etc, etc, fast enough to be able to run a server "tick" enough times per second - I would say 32 ticks per second is an absolute minimum, 64 is definitely better. So what happens when the processing required takes more than one CPU core for 1/32 seconds? A shared-nothing model is either fundamentally impossible, or a meaningless abstraction (if you interpret it to mean "explicit queues/pipes for everything"). What would the "Actor" model do here? Ideally, I would like to be able to write my code as a set of functions, then easily spin them off as separate threads, and have them able to magically run across separate CPUs. Unicorns not being a thing, I'm okay with warping my code a bit around the need for parallelism, but I'm not sure how best to do that. Assume here that we can't cheat by getting most of the processing work done with the GIL released (eg in Numpy), and it actually does require Python-level parallelism of CPU-heavy work. ChrisA

On Sun, Jul 15, 2018 at 6:00 PM, Chris Angelico <rosuav@gmail.com> wrote:
"Shared-nothing" is a bit of jargon that means there's no *implicit* sharing; your threads can still communicate, the communication just has to be explicit. I don't know exactly what algorithms your hypothetical game needs, but they might be totally fine in a shared-nothing approach. It's not just for embarrassingly parallel problems.
If you need shared-memory threads, on multiple cores, for CPU-bound logic, where the logic is implemented in Python, then yeah, you basically need a free-threaded implementation of Python. Jython is such an implementation. PyPy could be if anyone were interested in funding it [1], but apparently no-one is. Probably removing the GIL from CPython is impossible. (I'd be happy to be proven wrong.) Sorry I don't have anything better to report. The good news is that there are many, many situations where you don't actually need "shared-memory threads, on multiple cores, for CPU-bound logic, where the logic is implemented in Python". If you're in that specific niche and don't have $100k to throw at PyPy, then I dunno, I hear Rust is good at that sort of thing? It's frustrating for sure, but there will always be niches where Python isn't the best choice. -n [1] https://morepypy.blogspot.com/2017/08/lets-remove-global-interpreter-lock.ht... -- Nathaniel J. Smith -- https://vorpus.org

On Mon, Jul 16, 2018 at 1:21 PM, Nathaniel Smith <njs@pobox.com> wrote:
Right, so basically it's the exact model that Python *already* has for multiprocessing - once you go to separate processes, nothing is implicitly shared, and everything has to be done with queues.
(This was a purely hypothetical example.) There could be some interesting results from using the GIL only for truly global objects, and then having other objects guarded by arena locks. The trouble is that, in CPython, as soon as you reference any read-only object from the globals, you need to raise its refcount. ISTR someone mentioned something along the lines of sys.eternalize(obj) to flag something as "never GC this thing, it no longer has a refcount", which would then allow global objects to be referenced in a truly read-only way (eg to call a function). Sadly, I'm not expert enough to actually look into implementing it, but it does seem like a very cool concept. It also fits into the "warping my code a bit" category (eg eternalizing a small handful of key objects, and paying the price of "well, now they can never be garbage collected"), with the potential to then parallelize more easily.
Oh absolutely. MOST of my parallelism requirements involve regular Python threads, because they spend most of their time blocked on something. That one is easy. The hassle comes when something MIGHT need parallelism and might not, based on (say) how much data it has to work with; for those kinds of programs, I would like to be able to code it the simple way with minimal code overhead, but still able to split over cores. And yes, I'm aware that it's never going to be perfect, but the closer the better. ChrisA

What about the following model: you have N Python interpreters, each with their own GIL. Each *Python* object belongs to precisely one interpreter. However, the interpreters share some common data storage: perhaps a shared Numpy array, or a shared Sqlite in-memory db. Or some key-value store where the key and values are binary data. The interpreters communicate through that. Stephan Op ma 16 jul. 2018 06:25 schreef Chris Angelico <rosuav@gmail.com>:

On Mon, Jul 16, 2018 at 3:00 PM, Stephan Houben <stephanh42@gmail.com> wrote:
Interesting. The actual concrete idea that I had in mind was an image comparison job, downloading umpteen billion separate images and trying to find the one most similar to a template. Due to lack of easy parallelism I was unable to then compare the top thousand against each other quadratically, but it would be interesting to see if I could have done something like that to share image comparison information. ChrisA

On Mon, 16 Jul 2018 07:00:34 +0200 Stephan Houben <stephanh42@gmail.com> wrote:
What about the following model: you have N Python interpreters, each with their own GIL. Each *Python* object belongs to precisely one interpreter.
This is roughly what Eric's subinterpreters approach tries to do. Regards Antoine.

On 2018-07-16 05:24, Chris Angelico wrote:
Could you explicitly share an object in a similar way to how you explicitly open a file? The shared object's refcount would be incremented and the sharing function would return a proxy to the shared object. Refcounting in the thread/process would be done on the proxy. When the proxy is closed or garbage-collected, the shared object's refcount would be decremented. The shared object could be garbage-collected when its refcount drops to zero.

On Mon, 16 Jul 2018 18:00:37 +0100 MRAB <python@mrabarnett.plus.com> wrote:
Yes, I'm assuming that would be how shareable buffers could be implemented: a per-interpreter proxy (with a regular Python refcount) mediating access to a shared object (which could have an atomic / thread-safe refcount). As for how shareable buffers could be useful, see my work on PEP 574: https://www.python.org/dev/peps/pep-0574/ Regards Antoine.

On Mon, Jul 16, 2018 at 11:08 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
Nice! That's exactly how I'm doing it. :) The buffer protocol makes it easier, but the idea could apply to arbitrary objects generally. That's something I'll look into in a later phase of the project. In both cases the tricky part is ensuring that the proxy does not directly mutate the object (especially the refcount). In fact, the decref part above is the trickiest. The trickiness is a consequence of our goals. In my multi-core project we're aiming for not sharing the GIL between interpreters. That means reaching and keeping proper separation between interpreters. Notably, without a GIL shared by interpreters, refcount operations are not thread-safe. Also, in the decref case GC would happen under the wrong interpreter (which is problematic for several reasons). With this in mind, here's how I'm approaching the problem: 1. interp A "shares" an object with interp B (e.g. through a channel) * the object is incref'ed under A before it is sent to B 2. the object is wrapped in a proxy owned by B * the proxy may not make C-API calls that would mutate the object or even cause an incref/decref 3. when the proxy is GC'd, the original object is decref'ed * the decref must happen in a thread in which A is running In order to make all this work the missing piece is a mechanism by which the decref (#3) happens under the original interpreter. At the moment Emily Morehouse and I are pursuing an approach that extends the existing ceval "pending call" machinery currently used for handling signals (see Py_AddPendingCall). The new [*private*] API would work the same way but on a per-interpreter basis rather than just the main interpreter. This would allow one interpreter to queue up a decref to happen later under another interpreter. FWIW, this ability to decref an object under a different interpreter is a blocker right now for a number of things, including supporting buffers in PEP 554 channels. -eric

On Tue, Jul 17, 2018 at 1:44 PM Barry <barry@barrys-emacs.org> wrote:
The decrement itself is not the problem, that can be made thread safe.
Yeah, by using the GIL. <wink> Otherwise, please elaborate. My understanding is that if the decrement itself were not the problem then we'd have gotten rid of the GIL already.
Do you mean that once the ref reaches 0 you have to make the delete happen on the original interpreter?
Yep. For one thing, GC can trigger __del__, which can do anything, including modifying other objects from the original interpreter (incl. decref'ing them). __del__ should be run under the original interpreter. For another thing, during GC containers often decref their items. Also, separating the GIL between interpreters may mean we'll need an allocator per interpreter. In that case the deallocation must happen relative to the interpreter where the object was allocated. -eric

All processors have thread safe ways to inc and dec and test, integers without holding a lock. That is the mechanism that locks themselves are built out of. You can use that to avoid holding the GIL until the ref count reaches 0. In c++ they built it into the language with std::atomic_int, you would have to find the way to do this C, i don’t have an answer at my finger tips for C. Barry

Op 18 jul. 2018 om 08:02 heeft Barry <barry@barrys-emacs.org> het volgende geschreven:
Some past attempts at getting rid of the GIL used atomic inc/dec, and that resulted in bad performance because these instructions aren’t cheap. My gut feeling is that you’d have to get rid of refcounts to get high performance when getting rid of the GIL in a single interpreter, which would almost certainly result in breaking the C API. Ronald

Isn't this class of problem what leads to the per-processor caches and other optimisations in Linux kernel? I wonder if kernel optimisations could be applied to this problem?
My gut feeling is that you’d have to get rid of refcounts to get high performance when getting rid of the GIL in a single interpreter, which would almost certainly result in breaking the C API.
Working on the ref count costs might be the enabling tech. We already have the problem of unchanging objects being copied after a fork because of the ref counts being inside the object. It was suggested that the ref count would have to move out of the object to help with this problem. If there is a desirable solution to the parallel problem we can think about the C API migration problem. Barry

On Wed, 18 Jul 2018 08:21:31 +0100 Ronald Oussoren via Python-ideas <python-ideas@python.org> wrote:
Some past attempts at getting rid of the GIL used atomic inc/dec, and that resulted in bad performance because these instructions aren’t cheap.
Please read in context: we are not talking about making all refcounts atomic, only a couple refcounts on shared objects (which probably won't be Python objects, actually). Regards Antoine.

Let me try a longer answer. The inc+test and dec+test do not require a lock if coded correctly. All OS and run times have solved this to provide locks. All processors provide the instructions that are the building blocks for lock primitives. You cannot mutate a mutable python object that is not protected with the GIL as the change of state involves multiple parts of the object changing. If you know that an object is immutable then you could only do a check on the ref count as you will never change the state of the object beyond its ref count. To access the object you only have to ensure it will not be deleted, which the ref count guarantees. The delete of the immutable object is then the only job that the original interpreter must do.
Yep that I understand. Barry
-eric

On Wed, Jul 18, 2018 at 1:37 AM Barry Scott <barry@barrys-emacs.org> wrote:
Perhaps we're agreeing? Other than the single decref at when "releasing" the object, it won't ever be directly modified (even the refcount) in the other interpreter. In effect that interpreter holds a reference to the object which prevents GC in the "owning" interpreter (the corresponding incref happened in that original interpreter before the object was "shared"). The only issue is how to "release" the object in the other interpreter so that the decref happens in the "owning" interpreter. As earlier noted, I'm planning on taking advantage of the exiting ceval "pending calls" machinery. So I'm not sure where an atomic int would factor in. If you mean switching the exiting refcount to an atomic int for the sake of the cross-interpreter decref then that's not going to happen, as Ronald suggested. Larry could tell you about his Gilectomy experience. :) Are you suggesting something like a second "cross-interpreter refcount", which would be atomic, and add a check in Py_DECREF? That would imply an extra cross-interpreter-oriented C-API to parallel Py_DECREF. It would also mean either adding another field to PyObject (yikes!) or keeping a separate table for tracking cross-interpreter references. I'm not sure any of that would be better than the alternative I'm pursuing. Then again, I've considered tracking which interpreters hold a "reference" to an object, which isn't that different. -eric

Hi Eric, Antoine, all Antoine said that what I proposed earlier was very similar to what Eric is trying to do, but from the direction the discussion has taken so far that appears not to be the case. I will therefore try to clarify my proposal. Basically, what I am suggesting is a direct translation of Javascript's Web Worker API ( https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API) to Python. The Web Worker API is generally considered a "share-nothing" approach, although as we will see some state can be shared. The basic principle is that any object lives in a single Worker (Worker = subinterpreter). If a message is send from Worker A to Worker B, the message is not shared, rather the so-called "structured clone" algorithm is used to create recursively a NEW message object in Worker B. This is roughly equivalent to pickling in A and then unpickling in B, Of course, this may become a bottleneck if large amounts of data need to be communicated. Therefore, there is a special object type designed to provide a view upon a piece of shared memory: SharedArrayBuffer. Notable, this only provides a view upon raw "C"-style data (ints or floats or whatever), not on Javascript objects. To translate this to the Python situation: each Python object is owned by a single subinterpreter, and may only be manipulated by a thread which holds the GIL of that particular subinterpreter. Message sending between subinterpreters will require the message objects to be "structured cloned". Certain C extension types may override what structured cloning means for them. In particular, some C extension types may have a two-layer structure where the Py_Object contains a refcounted pointer to the actual data. The structured cloning on such an object may create a second Py_Object which references the same underlying object. This secondary refcount will need to be properly atomic, since it may be manipulated from multiple subinterpreters. In this way, interpreter-shared data structures can be implemented. However, all the "normal" Python objects are not shared and can continue to use the current, non-atomic refcounting implementation. Hope this clarifies my proposal. Stephan 2018-07-18 19:58 GMT+02:00 Eric Snow <ericsnowcurrently@gmail.com>:

Hi Python in the age of the multi-core processor is an important question. And garbage collection is one of the many issues involved. I've been thinking about the garbage collection problem, and lurking on this list, for a while. I think it's now about time I showed myself, and shared my thoughts. I intend to do this in a new thread, dealing only with the problem of multi-core reference counting garbage collection. I hope you don't mind my doing this. Expect the first instalment tomorrow. with best regards Jonathan

On Wed, Jul 18, 2018 at 12:49 PM Stephan Houben <stephanh42@gmail.com> wrote:
It looks like we are after the same thing actually. :) Sorry for any confusion. There are currently no provisions for actually sharing objects between interpreters. In fact, initially the plan is basically to support sharing copies of basic builtin immuntable types. The question of refcounts comes in when we actually do share underlying data of immutable objects (e.g. the buffer protocol).
Yes, there's a strong parallel to that model here. In fact, I mentioned web workers in my language summit talk at PyCon 2018.
That is exactly what the channels in the PEP 554 implementation do, though much more efficiently than pickling. Initial support will be for basic builtin immutable types. We can later consider support for other (even arbitrary?) types, but anything beyond copying (e.g. pickle) is way off my radar. Python's C-API is so closely tied to refcounting that we simply cannot support safely sharing actual Python objects between interpreters once we no longer share the GIL between them.
Yep, that translates to buffers in Python, which is covered by PEP 554 (see SendChannel.send_buffer). In this case, where some underlying data is actually shared, the implementation has to deal with keeping a reference to the original object and releasing it when done, which is what all the talk of refcounts has been about. However, the PEP does not talk about it because it is an implementation detail that is not exposed in Python.
Correct. That is what PEP 554 does. As an aside, your phrasing "may only be manipulated by a thread which holds the GIL of that particular subinterpreter" did spark something I'll consider later: perhaps interpreters can acquire each other's GIL when (infrequently) necessary. That could simplify a few things.
My implementation of PEP 554 supports this, though I have not made the C-API for it public. It's also not part of the PEP. I was considering adding it.
That is correct. That entirely matches what I'm doing with PEP 554. In fact, the isolation between interpreters is critical to my multi-core Python project, of which PEP 554 is a part. It's necessary in order to stop sharing the GIL between interpreters. So actual objects will never be shared between interpreters. They can't be.
Hope this clarifies my proposal.
Yep. Thanks! -eric

On 2018-07-18 20:35, Eric Snow wrote:
What if an object is not going to be shared, but instead "moved" from one subinterpreter to another? The first subinterpreter would no longer have a reference to the object. If the object's refcount is 1 and the object doesn't refer to any other object, then copying would not be necessary. [snip]

On Wed, Jul 18, 2018 at 2:38 PM MRAB <python@mrabarnett.plus.com> wrote:
Yeah, that's something that I'm sure we'll investigate at some point, but it's not part of the short-term plans. This belongs to a whole class of possibilities that we'll explore once we have the basic functionality established. :) FWIW, I don't think that "moving" an object like this would be to hard to implement. -eric

On Wed, Jul 18, 2018 at 11:49 AM, Stephan Houben <stephanh42@gmail.com> wrote:
Note that this everything you said here also exactly describes the programming model for the existing 'multiprocessing' module: "structured clone" is equivalent to how multiprocessing uses pickle to transfer arbitrary objects, or you can use multiprocessing.Array to get a shared view on raw "C"-style data. -n -- Nathaniel J. Smith -- https://vorpus.org

Hi Nathaniel, 2018-07-19 1:33 GMT+02:00 Nathaniel Smith <njs@pobox.com>:
This is true. In fact, I am a big fan of multiprocessing and I think it is often overlooked/underrated. Experience with multiprocessing is also what has me convinced that share-nothing or share-explicit approach to concurrency is a useful programming model. The main limitation of multiprocessing comes when you need to go outside Python, and you need to interact with C/C++ libraries or operating services from multiple processes. The support for this generally varies from "extremely weak" to "none at all". For example, things I would like to in parallel with a main thread/process: * Upload data to the GPU using OpenGL or OpenCL * Generate a picture in pyqt QImage, then hand over zero-copy to main thread * interact with a complex scenegraph in C++ (shared with main thread) This is impossible right now but would be possible if the interpreters were all in-process. In addition, there are things which are now hard with "multiprocessing" but could be fixed. For example, sharing a Numpy array is possible but very inconvenient. You need to first allocate the raw data segment, communicate that, then create in each process an array which uses this data segment. Ideally, this would rather work like this: ar = numpy.zeros((30, 30), shared=True) and then "ar" would automatically be shared. This is fixable but given the other limitations above the question is if it is worthwhile to fix it now. It would be a lot simpler to fix if we had the in-process model. But yeah, I am actually also very open to ideas on how multiprocessing could be made more convenient and powerful. Perhaps there are ways, and I am just not seeing them. Stephan

On 18 July 2018 at 05:35, Eric Snow <ericsnowcurrently@gmail.com> wrote:
Aw, I guess the original idea of just doing an active interpreter context switch in the current thread around the shared object decref operation didn't work out? That's a shame. I'd be curious as to the technical details of what actually failed in that approach, as I would have expected it to at least work, even if the performance might not have been wonderful. (Although thinking about it further now given a per-interpreter locking model, I suspect there could be some wonderful opportunities for cross-interpreter deadlocks that we didn't consider in our initial design sketch...) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Eric,
How does the proxy at the same time make the object accessible and prevent mutation? Would it help if there was an explicit owning thread for each object? I'm thinking that you can do a fast check that the object belongs to the current thread and can use that knowledge to avoid locking. If the object is owned by another thread acquire the GIL in the traditional way and mutating the state will be safe. The "sharing" process can ensure that until an explicit "unsharing" the object remains safe to test in all threads that share an object, avoiding the need for the special processor instructions. Barry

MRAB wrote:
What about other objects accessed through the shared object? They would need to get wrapped in proxies too. Also, if the shared object is mutable, changes to it would need to be protected by a lock of some kind. Maybe all this could be taken care of by the proxy objects, but it seems like it would be quite tricky to get right. -- Greg

On Sun, 15 Jul 2018 20:21:56 -0700 Nathaniel Smith <njs@pobox.com> wrote:
It's not that it's impossible, it's that everyone trying to remove it ended up with a 30-40% slowdown in a single-threaded mode (*). Perhaps Larry manages to do better, though ;-) (*) a figure which I assume is highly workload-dependent Regards Antoine.
participants (19)
-
Antoine Pitrou
-
Barry
-
Barry Scott
-
Chris Angelico
-
David Foster
-
Eric Snow
-
Greg Ewing
-
Jonathan Fine
-
Leonardo Santagada
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Pau Freixes
-
Ronald Oussoren
-
Sebastian Krause
-
Stephan Houben
-
Terry Reedy
-
Trent Nelson
-
Trent Nelson