Multi-core reference count garbage collection

Hi Based on other people's work (including in particular talks by Larry Hastings) and my own thinking, I've come up with a scheme for multi-core reference count garbage collection. I think it works, and much or all of it comes from others. It might even be in the literature. Of course, if it's wrong then the debit goes to me. I'll explain it in instalments. Please let me know what you think, as we go along. The basic ideas of reference counting garbage collection are: 1. Each allocated piece of memory is given an ID (which is often its address in memory). 2. For each ID, the system keeps a count of how many references (to that piece of memory). 3. The count starts at ONE. 4. Processes increment and decrement the count, to keep the count correct. 5. When the count reaches zero, the piece of memory is garbage collected. 6. The previous step might result in further reference count decrements. The simplest possible garbage collection scheme is to do nothing (and let the garbage build up). In other words, allocate on a stack, and never free memory. This has very good performance, until memory is exhausted. It is also conceptually useful. I'll call it the do-nothing garbage collection scheme. Suppose for definiteness we have a 6-core CPU, with 5 worker processes and one garbage collection (GC) process. The worker processes will do as little as possible, consistent with not running out of memory. To achieve this: 1. Each worker process with have two GC buffers, one for increment and the other for decrement. 2. For increment, the process will append the ID to the increment buffer (for the process). And similarly for decrement. 3. If the buffer to be appended to is full, the worker process will (a) set a buffer-full flag (b) pause until the buffer-full flag has been cleared (c) and then do what it was previously blocked from doing I call any such scheme BUFFERED multi-core reference count garbage collection. The worker processes know nothing about how garbage collection is managed. Instead, they pass over to the GC process sufficient information to allow it to manage the garbage. This is now a good place to pause. We've specified what it is that the worker processes should do, regarding garbage collection. And we've passed the problem on to the garbage collection process. If that process does nothing, we have refactored the do-nothing garbage collection scheme. Expect another instalment tomorrow. with best regards Jonathan

Hi Jonathan, 2018-07-19 8:33 GMT+02:00 Jonathan Fine <jfine2358@gmail.com>:
This is actually a known idea (which is GOOD, unless you wanted to apply for a patent ;-) ). It is described, among other places, in "The Garbage Collection Handbook: The Art of Automatic Memory Management", by Richard Jones, Antony Hosking, Eliot Moss. In fact, they also call it buffered reference counting. Section 18.2: Buffered Reference Counting: "...DeTreville[1990] had log mutators the old and new referents of each pointer update to a buffer" By the way, this book describes a ton of other ideas to speed up reference counting in general and in the concurrent case, so it may be worthwhile to get it. I should warn you that many of these advanced refcounting ideas have been proposed in the past already, although I am unaware of this particular technique having been implemented. Stephan

Hi Stephan Thank you for the extract from the GC Handbook, which I think I may have seen before. Yes, it is GOOD that it's an already known idea. Searching for "buffered reference counting" I found https://mail.python.org/pipermail/python-dev/2016-October/146696.html in which Larry Hastings says that C-python "plays games with reference counts" which makes implementing "buffered reference counting" harder. And he gives examples. Larry also writes [loc cit] about resurrecting objects. I don't know what he means by this. It may be something to do with weak references. Larry's post gives some details, in which difficulties may lie. In particular, he writes === https://mail.python.org/pipermail/python-dev/2016-October/146604.html It's my contention that this API [for weak references] is simply untenable under the Gilectomy, and that it needs to change to returning a new (strong) reference. === However, I'm focused on the larger picture right now. And it good to know something of where the hazards lie. Once again, Stephan, thank you for your contribution to this thread. On Thu, Jul 19, 2018 at 9:47 AM, Stephan Houben <stephanh42@gmail.com> wrote:

On 2018-07-19 11:53, Jonathan Fine wrote:
[snip] When an object's refcount drops to 0, the object's __del__ method (if defined) is called, and then the object's memory can be reclaimed. But what if the __del__ method creates a new reference to the object? The object's refcount is no longer 0, so the object is no longer garbage. That's called "resurrecting an object".

In an earlier post, I defined BUFFERED multi-core reference count garbage collection. The basic idea is that each worker process store in buffers the need to do an INCR and DECR on the reference count for a piece of memory with a given ID. There is then a garbage collection process that does the right thing. https://mail.python.org/pipermail/python-ideas/2018-July/052054.html An important aside. I'm very grateful for all the comments, which are very helpful, made to the first part. It's really nice to know that mine is not a new idea, and about some of the features and quirks (or optimisations if you like) in the CPython GC system. Here, I describe the easy part of doing the right thing. And make a start on the hard part. EASY PART The GC process is responsible for keeping the running totals of the reference counts. It does this by processing the INCR and DECR buffers that are passed to it. (Once these buffers are processed they can, of course, be passed back to the system for reuse.) Where to store the totals of the reference counts is a design decision. The main constraint is that ONLY the GC process should be changing these counts. The buffering is roughly equivalent to the changes made by the worker processes being queued for the GC process to perform. But buffering also allows the 'queue' to be reordered. This can substantially reduce the number of system calls required to coordinate the various processes. And the amount of time a process is locked out from running. We could, following CPython, store the reference counts as a system field in the object being stored. This is probably most economical, regarding use of memory. However, if the GC process has associated to it some high-speed memory of its own, it will be quicker if possible to store the reference count totals in that high-speed memory. And so storing the counts separately will probably make better use of process-specific high speed memory. Either way the algorithm is easy. HARD PART The do-nothing garbage collection avoids the hard part, which is to know when a piece of memory can be reclaimed. With single-core reference counting, this would be when the count becomes zero. But is the same true for multi-core garbage collection? The answer is both YES and NO. First for the NO. They buffered scheme described here is asychronous, and in particular reordering. So there might be INCR operations pending, that would take the global reference count from zero to one. (And it can go the other way. The object might be inaccessible, but not all of its DECR operations are processed yet.) So, and this is important, we cannot rely on the reference count totals owned by the GC process to tell is if there is or is not a reference to the object in the worker threads. Now for the YES. The operations for the processes can be put in a definite temporal order (i.e. serialized). When this is done then (assuming every process is behaving as it should) the following statement is true: Suppose at some point T in time the quantity * the GC process reference count for the object with some ID * plus any pending INCR operations in worker process buffers * minus any similarly pending DECR operations is zero. Call this the adjusted reference count. If that is true at T, then after time T the object with the given ID is inaccessible to the worker processes. And so the piece of memory can be reclaimed. We can, of course, compute the adjusted reference count by pausing all working threads. In other words, provide and use a global interpreter lock. The remainder of the hard part is to get hold of adjusted reference counts, without having to lock the worker threads. As before, remarks and comments very welcome. The previous ones were much appreciated. -- Jonathan

Hi All INTRODUCTION This is the third and concluding post, where I describe a scheme for multi-core reference counting garbage collection. The first two posts are https://mail.python.org/pipermail/python-ideas/2018-July/052054.html https://mail.python.org/pipermail/python-ideas/2018-July/052151.html This subject is quite technical, with pitfalls and a race hazard. If you don't understand what I've written, or think it wrong, it might be that I screwed up and got it wrong. I personally find it a hard subject. Or it might be that you've missed something in one of the two previous posts. Or we might both be wrong. Or both right! For the race hazard, see https://mail.python.org/pipermail/python-ideas/2018-July/052190.html There have been many valuable contributions to this discussion thread, which I will acknowledge later, in another post. However, I do repeat that this scheme is based on other people's work (particularly talks by Larry Hastings), as well as my own thinking. And it is actually a known and published idea (which is reassuring). https://mail.python.org/pipermail/python-ideas/2018-July/052061.html THE STORY SO FAR We have 5 worker processes and one garbage collection (GC) process. Each worker process has an INCR buffer in which it logs an ID, every time the process increases the refcount of the object with that ID. And similarly for DECR. When the worker INCR and DECR buffers are full, they are passed over to the GC process. The GC process keeps, for each object ID, a running total of how many references. In this way it consumes the information in the INCR and DECR buffers. The problem, as always in garbage collection, is to reclaim objects that will no longer be used. With reference counting, objects whose reference count is zero can be reclaimed. This is often done, in single process systems, immediately after the reference count becomes zero. (An important aside: For multi-core machines this is an unhelpful constraint. Relaxing it will allow the worker processes to run more freely.) At each point in time, there are for each object ID two reference counts. The first is the running total, updated as full INCR and DECR buffers are applied to it. I will call this the RAW TOTAL. The second is the raw total, adjusted by all the INCR and DECR buffers in the worker processes. I will call this the REAL TOTAL. Once the real total is zero, it stays zero. This is an immediate consequence of the worker threads having no references to the object. Having no references, they cannot do an INCR or DECR. When the real total is zero, the object can be reclaimed. The problem is to refine what we have, so that all objects whose real count is zero are in good time reclaimed. We want to avoid locking all the worker threads, for example to compute real counts for all IDs. BEING TRICKY Think of the worker threads as an adversary, who are trying to delay or fool the garbage collector. Here's one thing that can be done. Have a worker thread 1. Acquire (references to) some objects (say from other worker processes). 2. Delete those object references, without filling either the INCR or DECR buffer. 3. Spin its wheels endlessly. This will defeat the GC process, unless the system from time to time sends the INCR and DECR buffers to the GC process, whether or not they are full. So we'll add this to the system requirements. Here's something else, which is part of normal use. We have one worker thread do a calculation for another. The return value is passed from one thread to another. It could happen that, for some ID, an INCR in one thread is followed by a DECR in another. Now suppose that the GC process gets * First, the process buffer containing the DECR. * And then the process buffer that contains the INCR. (This is a race hazard. The order of operations has been reversed.) Thus it can happen that the raw total for an ID can go down to zero and then increase up to one. This can't happen for the true total, of course. (With more work, the raw counts can go negative!) DEFEATING TRICKINESS The problem is for the GC process to discover IDs for which the real total is zero. And in good time. Here's how we'll do it. Suppose an ID has raw total zero. That makes it a candidate ID. Now do a global INCR update. By this we mean that we have the GC thread get INCR buffers from all the worker threads, and applies them to the raw totals. If the raw total for the candidate ID (drum roll) is still zero, then the real total is also zero. That's it. The rest is implementation. Let's see why this is. First, note that we don't need to know exactly when the real count became zero. We just need to know that it is now zero. (And once zero, it stays zero.) Now note that the real count can never be negative. Now suppose at time T the raw count (for an ID) is zero, but that the real count is one or more. That means that some worker process has a reference, and so the ID appears in some INCR buffer. Well, we just did a global INCR update, and found that the candidate ID was still zero. So the ID wasn't in an INCR buffer. We can reclaim the memory. A MATHEMATICAL APPROACH The previous argument was perhaps a computer science approach. Here's a more mathematical approach. We are using a special property of global INCR update. Suppose that at time T the raw count is N. Now do a global INCR update, to get raw count M. Let R be the real count at time T. We have * N <= M, because processing only INCR buffers. * R <= M, because ** we skipped the DECR buffers ** and at most enlarged the INCR buffers We always have 0 <= R, and after the global INCR update we have R <= M. That was at our time T, when we started the global INCR update. But once zero, R stays zero. So M == 0 is our condition for garbage collection, when M is the value after a global INCR update. WHERE NOW For me, this thread has been an exploration. I'm now convinced, although you may not be, that buffered multi-core reference counting garbage collections is something that can be done. And that it will allow the worker processes to run freely, at least when it comes to garbage collection. This discussion thread contributes nothing to important problem of multiple processes mutating (writing to) the same object. (Except that buffering provides a solution in the special case where order doesn't matter, such 1 + 2 + 3 = 2 + 3 + 1.) Well done for getting to the end of this long and difficult material. I've written it with extra special care, but expect that it can still be made clearer, and might have errors. I'd be delighted if this discussion thread helps us make CPython run faster on multi-core machines, at least for some users. If you've got some ideas about that, I'd love to hear them. Questions and comments are, of course, welcome. Especially corrections and improvements. -- Jonathan

Hi Steve You wrote
The problem I'm trying to solve doing multi-core reference counting garbage collection without a GIL. As far as I know, we're not yet able to do efficient multi-core garbage collection using Python. If anyone can tell me that's already been solved, I'd be delighted. Stephan Houben's post tells us that buffered reference counting appears in the major reference work "The Garbage Collection Handbook: The Art of Automatic Memory Management", by Richard Jones, Antony Hosking, Eliot Moss So it might be a good idea. Please let me know if you'd like further information. -- Jonathan

On Fri, Jul 20, 2018 at 06:37:49PM +0100, Jonathan Fine wrote:
Perhaps I'm being a bit slow tonight, but that sounds like you're just repeating the solution: "multi-core reference counting garbage collection without a GIL". What problem does that solve? If the quote-unquote "problem" is "get rid of the GIL", there are already at least two implementations of Python without a GIL (IronPython and Jython). Do they solve your problem? If there's some other problem you are trying to solve, you should explain what it is.
As far as I know, we're not yet able to do efficient multi-core garbage collection using Python.
Why would that be an advantage? My understanding is that reference counting is both deterministic and immediate. Shifting the reference counting into another thread so that it becomes non-deterministic and potentially delayed doesn't sound like an advantage to my naive understanding. But then I've never been one to think that threads are the only, or even the best, form of concurrency. I've never understood why so many people care about the GIL. (Perhaps I'm jaundiced from dealing with too many people who honestly believe that the GIL is why their single-threaded code runs slowly. My favourite was the fellow I knew who was so convinced that Python was slow because of the GIL that he moved to Ruby, which also has a GIL and is slower than Python.) -- Steve

Hi Steve You wrote:
The choice is not as simple as that. Here's an example. Suppose you and your friend want to sort a deck of cards. Here's an algorithm. * Divide the deck into two * Each person sorts their half deck * One of you does a merge of the two half decks This algorithm is non-deterministic. But for a large deck it's quicker than any deterministic algorithm. I hope this helps. -- Jonathan

On Sat, Jul 21, 2018 at 6:44 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
Divide the deck into "first half" and "second half". Then sort the half decks using a stable algorithm. Finally, merge the decks, favouring cards from the first half. Voila! You now have a fully deterministic and stable sort. ChrisA

Hi Chris Thank you for your message about two processes together sorting a deck of cards. My example was in response to a comment from Steve D'Aprano. He understood in the way it was intended, which was the algorithm is in execution non-deterministic, but that the outcome is deterministic. Steve, in a message after yours, explains this point. And so I won't repeat what would be essentially the same explanation. Once again, I thank you (and Steve) for their contributions. -- Jonathan

On Fri, Jul 20, 2018 at 09:44:39PM +0100, Jonathan Fine wrote:
Jonathan, you keep avoiding my direct questions. What problem are you trying to solve? Its okay if there is no immediate problem, that you're just exploring alternative garbage collection strategies. Or if you're trying to remove the GIL. Or you think that this will make print faster. (Probably not the last one *wink*) And please don't patronise me -- I might not be an expert on garbage collection but I know that parallelising certain tasks can make them run faster. Your sorting example is irrelevant to understanding this reference counting proposal. Help us understand where you are coming from in this discussion.
No in any important way it isn't. Although the sub-sorting steps may run in any order, the final merge doesn't occur until they are both completed. There are no race conditions where the merge happens before the sort, for example (unless your implementation is *really* bad). Which means from the perspective of the outsider the overall sort is deterministic: you start with an unsorted list, call sort, and when the sort() function returns, you have a sorted list. The non-determinism is no more important than the random partition in Quicksort. But in the case of reference counting, I understand that multithreaded reference counting can suffer from race conditions when incrementing or decrementing: http://flyingfrogblog.blogspot.com/2013/09/how-do-reference-counting-and-tra... That would be a bad thing. But as I said, my view of multi-threaded reference counting is very naive, and I'm no expert. I'm less interested in the technical details of how and whether it works as to *why* you are interested in trying it. Hence my repeated requests for you to explain what problem(s) (if any) you are hoping to solve. -- Steve

Hi Steve Thank you for your message. I think my response below allows us to go move forward. WHAT'S THE PROBLEM You asked:
I'm exploring. I hope we'll find something that helps CPython run faster on multi-core machines, at least for some users. I hope this helps you understand where I'm coming from. Please, in this thread, can we confine ourselves to getting a better understanding of multi-core reference count garbage collection. That is for me already hard enough, in isolation, without mixing in other issues. RACE CONDITION You wrote
http://flyingfrogblog.blogspot.com/2013/09/how-do-reference-counting-and-tra... The relevant statement on that page is
multithreaded reference counting is non-deterministic because increments and decrements race.
According to https://en.wikipedia.org/wiki/Race_condition --- A race condition or race hazard is the behavior of an electronics, software, or other system where the output is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when events do not happen in the order the programmer intended. --- I like this, but I would change the last sentence to read
DEALING WITH THE RACE CONDITION There is a race condition (or hazard) in buffered reference counting garbage collection. You wrote that this "would be a bad thing". I hope that, at the expense of some complexity in the code, the potential for a bug arising from race hazard can be avoided. Certainly, a bug in garbage collection is a bad thing. I was aware of this race hazard when I started this discussion thread. In my first post of yesterday I made "a start on the hard part" of "doing the right thing". I also then said
The remainder of the hard part is to get hold of adjusted reference counts, without having to lock the worker threads.
GOING FORWARD I hope in the next day or two to make a start on the remainder of the hard part. Your link to flyingfrog is valuable. I hope you'll contribute again. Finally, you might help you to know that I'm trying in my exploration to use top-down or stepwise design. -- Jonathan

I think that the problem that needs a solution is providing each thread with knowledge that is safe to update state. The questions that any solution needs to answer is then: 1) How does the solution provide each thread with the knowledge that it is safe to mutate state? 2) How does the solution manage the life time of the objects? For 1 today: In python all state is shared between all threads. One lock, the GIL, provides the code with the knowledge that it is safe to mutate state. For 2 today: The reference counts track the number of uses of an object. In the simple case when the ref count reaches 0 run the __del__ protocol. In the cases where circular reference patterns prevents detecting that an object is no longer needed via reference counts the exist python garbage collector figures out that such objects can be deleted. I'd note that the reference counts are just one piece of state of many pieces of state in an object. As an example of a idea that is looking directly at these questions is the work on using a sub-interpreter for each thread is interesting. It is aiming to solve (1) by using a per/interpreter lock, which will allow each thread to run at full speed. I'm following with interest how the life time management will work. In multi processing (1) is solved by no longer sharing object state.
Please, in this thread, can we confine ourselves to getting a better understanding of multi-core reference count garbage collection. That is for me already hard enough, in isolation, without mixing in other issues.
I'd argue that the ref counts are not interesting at all, only a side effect of one possible solution to the object life time problem. Barry

Hi Barry We've met before. Nice to meet you again, this time electronically. You suggested that is a different problem that needs a solution. To help maintain focus, I won't respond to that now. You wrote:
I'd argue that the ref counts are not interesting at all, only a side effect of one possible solution to the object life time problem.
I'm happy for you to regard multi-core reference counting (MCRC) as a toy problem, which won't become part of useful software. We can learn a lot by understanding a good toy. Einstein was fascinated by the compass, and the spinning top is surprisingly deep. https://www.johnshepler.com/articles/einstein.html https://www.einsteinscompass.com/ https://en.wikipedia.org/wiki/Gyrocompass Of course, I hope you won't mind if MCRC turns out to useful. -- Jonathan

Perhaps the point was that reference counting is only one of the issues (race conditions?) the GIL solves. So a GIL - free reference counting scheme may not prove to be helpful. If that were it, we could (optionally) turn off reference counting in multi-threaded code, and run a garbage collector once in a while instead. After all, cPython’s (mostly) deterministic garbage collection is not guaranteed by the language standard. -CHB

Hi Jonathan, 2018-07-19 8:33 GMT+02:00 Jonathan Fine <jfine2358@gmail.com>:
This is actually a known idea (which is GOOD, unless you wanted to apply for a patent ;-) ). It is described, among other places, in "The Garbage Collection Handbook: The Art of Automatic Memory Management", by Richard Jones, Antony Hosking, Eliot Moss. In fact, they also call it buffered reference counting. Section 18.2: Buffered Reference Counting: "...DeTreville[1990] had log mutators the old and new referents of each pointer update to a buffer" By the way, this book describes a ton of other ideas to speed up reference counting in general and in the concurrent case, so it may be worthwhile to get it. I should warn you that many of these advanced refcounting ideas have been proposed in the past already, although I am unaware of this particular technique having been implemented. Stephan

Hi Stephan Thank you for the extract from the GC Handbook, which I think I may have seen before. Yes, it is GOOD that it's an already known idea. Searching for "buffered reference counting" I found https://mail.python.org/pipermail/python-dev/2016-October/146696.html in which Larry Hastings says that C-python "plays games with reference counts" which makes implementing "buffered reference counting" harder. And he gives examples. Larry also writes [loc cit] about resurrecting objects. I don't know what he means by this. It may be something to do with weak references. Larry's post gives some details, in which difficulties may lie. In particular, he writes === https://mail.python.org/pipermail/python-dev/2016-October/146604.html It's my contention that this API [for weak references] is simply untenable under the Gilectomy, and that it needs to change to returning a new (strong) reference. === However, I'm focused on the larger picture right now. And it good to know something of where the hazards lie. Once again, Stephan, thank you for your contribution to this thread. On Thu, Jul 19, 2018 at 9:47 AM, Stephan Houben <stephanh42@gmail.com> wrote:

On 2018-07-19 11:53, Jonathan Fine wrote:
[snip] When an object's refcount drops to 0, the object's __del__ method (if defined) is called, and then the object's memory can be reclaimed. But what if the __del__ method creates a new reference to the object? The object's refcount is no longer 0, so the object is no longer garbage. That's called "resurrecting an object".

In an earlier post, I defined BUFFERED multi-core reference count garbage collection. The basic idea is that each worker process store in buffers the need to do an INCR and DECR on the reference count for a piece of memory with a given ID. There is then a garbage collection process that does the right thing. https://mail.python.org/pipermail/python-ideas/2018-July/052054.html An important aside. I'm very grateful for all the comments, which are very helpful, made to the first part. It's really nice to know that mine is not a new idea, and about some of the features and quirks (or optimisations if you like) in the CPython GC system. Here, I describe the easy part of doing the right thing. And make a start on the hard part. EASY PART The GC process is responsible for keeping the running totals of the reference counts. It does this by processing the INCR and DECR buffers that are passed to it. (Once these buffers are processed they can, of course, be passed back to the system for reuse.) Where to store the totals of the reference counts is a design decision. The main constraint is that ONLY the GC process should be changing these counts. The buffering is roughly equivalent to the changes made by the worker processes being queued for the GC process to perform. But buffering also allows the 'queue' to be reordered. This can substantially reduce the number of system calls required to coordinate the various processes. And the amount of time a process is locked out from running. We could, following CPython, store the reference counts as a system field in the object being stored. This is probably most economical, regarding use of memory. However, if the GC process has associated to it some high-speed memory of its own, it will be quicker if possible to store the reference count totals in that high-speed memory. And so storing the counts separately will probably make better use of process-specific high speed memory. Either way the algorithm is easy. HARD PART The do-nothing garbage collection avoids the hard part, which is to know when a piece of memory can be reclaimed. With single-core reference counting, this would be when the count becomes zero. But is the same true for multi-core garbage collection? The answer is both YES and NO. First for the NO. They buffered scheme described here is asychronous, and in particular reordering. So there might be INCR operations pending, that would take the global reference count from zero to one. (And it can go the other way. The object might be inaccessible, but not all of its DECR operations are processed yet.) So, and this is important, we cannot rely on the reference count totals owned by the GC process to tell is if there is or is not a reference to the object in the worker threads. Now for the YES. The operations for the processes can be put in a definite temporal order (i.e. serialized). When this is done then (assuming every process is behaving as it should) the following statement is true: Suppose at some point T in time the quantity * the GC process reference count for the object with some ID * plus any pending INCR operations in worker process buffers * minus any similarly pending DECR operations is zero. Call this the adjusted reference count. If that is true at T, then after time T the object with the given ID is inaccessible to the worker processes. And so the piece of memory can be reclaimed. We can, of course, compute the adjusted reference count by pausing all working threads. In other words, provide and use a global interpreter lock. The remainder of the hard part is to get hold of adjusted reference counts, without having to lock the worker threads. As before, remarks and comments very welcome. The previous ones were much appreciated. -- Jonathan

Hi All INTRODUCTION This is the third and concluding post, where I describe a scheme for multi-core reference counting garbage collection. The first two posts are https://mail.python.org/pipermail/python-ideas/2018-July/052054.html https://mail.python.org/pipermail/python-ideas/2018-July/052151.html This subject is quite technical, with pitfalls and a race hazard. If you don't understand what I've written, or think it wrong, it might be that I screwed up and got it wrong. I personally find it a hard subject. Or it might be that you've missed something in one of the two previous posts. Or we might both be wrong. Or both right! For the race hazard, see https://mail.python.org/pipermail/python-ideas/2018-July/052190.html There have been many valuable contributions to this discussion thread, which I will acknowledge later, in another post. However, I do repeat that this scheme is based on other people's work (particularly talks by Larry Hastings), as well as my own thinking. And it is actually a known and published idea (which is reassuring). https://mail.python.org/pipermail/python-ideas/2018-July/052061.html THE STORY SO FAR We have 5 worker processes and one garbage collection (GC) process. Each worker process has an INCR buffer in which it logs an ID, every time the process increases the refcount of the object with that ID. And similarly for DECR. When the worker INCR and DECR buffers are full, they are passed over to the GC process. The GC process keeps, for each object ID, a running total of how many references. In this way it consumes the information in the INCR and DECR buffers. The problem, as always in garbage collection, is to reclaim objects that will no longer be used. With reference counting, objects whose reference count is zero can be reclaimed. This is often done, in single process systems, immediately after the reference count becomes zero. (An important aside: For multi-core machines this is an unhelpful constraint. Relaxing it will allow the worker processes to run more freely.) At each point in time, there are for each object ID two reference counts. The first is the running total, updated as full INCR and DECR buffers are applied to it. I will call this the RAW TOTAL. The second is the raw total, adjusted by all the INCR and DECR buffers in the worker processes. I will call this the REAL TOTAL. Once the real total is zero, it stays zero. This is an immediate consequence of the worker threads having no references to the object. Having no references, they cannot do an INCR or DECR. When the real total is zero, the object can be reclaimed. The problem is to refine what we have, so that all objects whose real count is zero are in good time reclaimed. We want to avoid locking all the worker threads, for example to compute real counts for all IDs. BEING TRICKY Think of the worker threads as an adversary, who are trying to delay or fool the garbage collector. Here's one thing that can be done. Have a worker thread 1. Acquire (references to) some objects (say from other worker processes). 2. Delete those object references, without filling either the INCR or DECR buffer. 3. Spin its wheels endlessly. This will defeat the GC process, unless the system from time to time sends the INCR and DECR buffers to the GC process, whether or not they are full. So we'll add this to the system requirements. Here's something else, which is part of normal use. We have one worker thread do a calculation for another. The return value is passed from one thread to another. It could happen that, for some ID, an INCR in one thread is followed by a DECR in another. Now suppose that the GC process gets * First, the process buffer containing the DECR. * And then the process buffer that contains the INCR. (This is a race hazard. The order of operations has been reversed.) Thus it can happen that the raw total for an ID can go down to zero and then increase up to one. This can't happen for the true total, of course. (With more work, the raw counts can go negative!) DEFEATING TRICKINESS The problem is for the GC process to discover IDs for which the real total is zero. And in good time. Here's how we'll do it. Suppose an ID has raw total zero. That makes it a candidate ID. Now do a global INCR update. By this we mean that we have the GC thread get INCR buffers from all the worker threads, and applies them to the raw totals. If the raw total for the candidate ID (drum roll) is still zero, then the real total is also zero. That's it. The rest is implementation. Let's see why this is. First, note that we don't need to know exactly when the real count became zero. We just need to know that it is now zero. (And once zero, it stays zero.) Now note that the real count can never be negative. Now suppose at time T the raw count (for an ID) is zero, but that the real count is one or more. That means that some worker process has a reference, and so the ID appears in some INCR buffer. Well, we just did a global INCR update, and found that the candidate ID was still zero. So the ID wasn't in an INCR buffer. We can reclaim the memory. A MATHEMATICAL APPROACH The previous argument was perhaps a computer science approach. Here's a more mathematical approach. We are using a special property of global INCR update. Suppose that at time T the raw count is N. Now do a global INCR update, to get raw count M. Let R be the real count at time T. We have * N <= M, because processing only INCR buffers. * R <= M, because ** we skipped the DECR buffers ** and at most enlarged the INCR buffers We always have 0 <= R, and after the global INCR update we have R <= M. That was at our time T, when we started the global INCR update. But once zero, R stays zero. So M == 0 is our condition for garbage collection, when M is the value after a global INCR update. WHERE NOW For me, this thread has been an exploration. I'm now convinced, although you may not be, that buffered multi-core reference counting garbage collections is something that can be done. And that it will allow the worker processes to run freely, at least when it comes to garbage collection. This discussion thread contributes nothing to important problem of multiple processes mutating (writing to) the same object. (Except that buffering provides a solution in the special case where order doesn't matter, such 1 + 2 + 3 = 2 + 3 + 1.) Well done for getting to the end of this long and difficult material. I've written it with extra special care, but expect that it can still be made clearer, and might have errors. I'd be delighted if this discussion thread helps us make CPython run faster on multi-core machines, at least for some users. If you've got some ideas about that, I'd love to hear them. Questions and comments are, of course, welcome. Especially corrections and improvements. -- Jonathan

Hi Steve You wrote
The problem I'm trying to solve doing multi-core reference counting garbage collection without a GIL. As far as I know, we're not yet able to do efficient multi-core garbage collection using Python. If anyone can tell me that's already been solved, I'd be delighted. Stephan Houben's post tells us that buffered reference counting appears in the major reference work "The Garbage Collection Handbook: The Art of Automatic Memory Management", by Richard Jones, Antony Hosking, Eliot Moss So it might be a good idea. Please let me know if you'd like further information. -- Jonathan

On Fri, Jul 20, 2018 at 06:37:49PM +0100, Jonathan Fine wrote:
Perhaps I'm being a bit slow tonight, but that sounds like you're just repeating the solution: "multi-core reference counting garbage collection without a GIL". What problem does that solve? If the quote-unquote "problem" is "get rid of the GIL", there are already at least two implementations of Python without a GIL (IronPython and Jython). Do they solve your problem? If there's some other problem you are trying to solve, you should explain what it is.
As far as I know, we're not yet able to do efficient multi-core garbage collection using Python.
Why would that be an advantage? My understanding is that reference counting is both deterministic and immediate. Shifting the reference counting into another thread so that it becomes non-deterministic and potentially delayed doesn't sound like an advantage to my naive understanding. But then I've never been one to think that threads are the only, or even the best, form of concurrency. I've never understood why so many people care about the GIL. (Perhaps I'm jaundiced from dealing with too many people who honestly believe that the GIL is why their single-threaded code runs slowly. My favourite was the fellow I knew who was so convinced that Python was slow because of the GIL that he moved to Ruby, which also has a GIL and is slower than Python.) -- Steve

Hi Steve You wrote:
The choice is not as simple as that. Here's an example. Suppose you and your friend want to sort a deck of cards. Here's an algorithm. * Divide the deck into two * Each person sorts their half deck * One of you does a merge of the two half decks This algorithm is non-deterministic. But for a large deck it's quicker than any deterministic algorithm. I hope this helps. -- Jonathan

On Sat, Jul 21, 2018 at 6:44 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
Divide the deck into "first half" and "second half". Then sort the half decks using a stable algorithm. Finally, merge the decks, favouring cards from the first half. Voila! You now have a fully deterministic and stable sort. ChrisA

Hi Chris Thank you for your message about two processes together sorting a deck of cards. My example was in response to a comment from Steve D'Aprano. He understood in the way it was intended, which was the algorithm is in execution non-deterministic, but that the outcome is deterministic. Steve, in a message after yours, explains this point. And so I won't repeat what would be essentially the same explanation. Once again, I thank you (and Steve) for their contributions. -- Jonathan

On Fri, Jul 20, 2018 at 09:44:39PM +0100, Jonathan Fine wrote:
Jonathan, you keep avoiding my direct questions. What problem are you trying to solve? Its okay if there is no immediate problem, that you're just exploring alternative garbage collection strategies. Or if you're trying to remove the GIL. Or you think that this will make print faster. (Probably not the last one *wink*) And please don't patronise me -- I might not be an expert on garbage collection but I know that parallelising certain tasks can make them run faster. Your sorting example is irrelevant to understanding this reference counting proposal. Help us understand where you are coming from in this discussion.
No in any important way it isn't. Although the sub-sorting steps may run in any order, the final merge doesn't occur until they are both completed. There are no race conditions where the merge happens before the sort, for example (unless your implementation is *really* bad). Which means from the perspective of the outsider the overall sort is deterministic: you start with an unsorted list, call sort, and when the sort() function returns, you have a sorted list. The non-determinism is no more important than the random partition in Quicksort. But in the case of reference counting, I understand that multithreaded reference counting can suffer from race conditions when incrementing or decrementing: http://flyingfrogblog.blogspot.com/2013/09/how-do-reference-counting-and-tra... That would be a bad thing. But as I said, my view of multi-threaded reference counting is very naive, and I'm no expert. I'm less interested in the technical details of how and whether it works as to *why* you are interested in trying it. Hence my repeated requests for you to explain what problem(s) (if any) you are hoping to solve. -- Steve

Hi Steve Thank you for your message. I think my response below allows us to go move forward. WHAT'S THE PROBLEM You asked:
I'm exploring. I hope we'll find something that helps CPython run faster on multi-core machines, at least for some users. I hope this helps you understand where I'm coming from. Please, in this thread, can we confine ourselves to getting a better understanding of multi-core reference count garbage collection. That is for me already hard enough, in isolation, without mixing in other issues. RACE CONDITION You wrote
http://flyingfrogblog.blogspot.com/2013/09/how-do-reference-counting-and-tra... The relevant statement on that page is
multithreaded reference counting is non-deterministic because increments and decrements race.
According to https://en.wikipedia.org/wiki/Race_condition --- A race condition or race hazard is the behavior of an electronics, software, or other system where the output is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when events do not happen in the order the programmer intended. --- I like this, but I would change the last sentence to read
DEALING WITH THE RACE CONDITION There is a race condition (or hazard) in buffered reference counting garbage collection. You wrote that this "would be a bad thing". I hope that, at the expense of some complexity in the code, the potential for a bug arising from race hazard can be avoided. Certainly, a bug in garbage collection is a bad thing. I was aware of this race hazard when I started this discussion thread. In my first post of yesterday I made "a start on the hard part" of "doing the right thing". I also then said
The remainder of the hard part is to get hold of adjusted reference counts, without having to lock the worker threads.
GOING FORWARD I hope in the next day or two to make a start on the remainder of the hard part. Your link to flyingfrog is valuable. I hope you'll contribute again. Finally, you might help you to know that I'm trying in my exploration to use top-down or stepwise design. -- Jonathan

I think that the problem that needs a solution is providing each thread with knowledge that is safe to update state. The questions that any solution needs to answer is then: 1) How does the solution provide each thread with the knowledge that it is safe to mutate state? 2) How does the solution manage the life time of the objects? For 1 today: In python all state is shared between all threads. One lock, the GIL, provides the code with the knowledge that it is safe to mutate state. For 2 today: The reference counts track the number of uses of an object. In the simple case when the ref count reaches 0 run the __del__ protocol. In the cases where circular reference patterns prevents detecting that an object is no longer needed via reference counts the exist python garbage collector figures out that such objects can be deleted. I'd note that the reference counts are just one piece of state of many pieces of state in an object. As an example of a idea that is looking directly at these questions is the work on using a sub-interpreter for each thread is interesting. It is aiming to solve (1) by using a per/interpreter lock, which will allow each thread to run at full speed. I'm following with interest how the life time management will work. In multi processing (1) is solved by no longer sharing object state.
Please, in this thread, can we confine ourselves to getting a better understanding of multi-core reference count garbage collection. That is for me already hard enough, in isolation, without mixing in other issues.
I'd argue that the ref counts are not interesting at all, only a side effect of one possible solution to the object life time problem. Barry

Hi Barry We've met before. Nice to meet you again, this time electronically. You suggested that is a different problem that needs a solution. To help maintain focus, I won't respond to that now. You wrote:
I'd argue that the ref counts are not interesting at all, only a side effect of one possible solution to the object life time problem.
I'm happy for you to regard multi-core reference counting (MCRC) as a toy problem, which won't become part of useful software. We can learn a lot by understanding a good toy. Einstein was fascinated by the compass, and the spinning top is surprisingly deep. https://www.johnshepler.com/articles/einstein.html https://www.einsteinscompass.com/ https://en.wikipedia.org/wiki/Gyrocompass Of course, I hope you won't mind if MCRC turns out to useful. -- Jonathan

Perhaps the point was that reference counting is only one of the issues (race conditions?) the GIL solves. So a GIL - free reference counting scheme may not prove to be helpful. If that were it, we could (optionally) turn off reference counting in multi-threaded code, and run a garbage collector once in a while instead. After all, cPython’s (mostly) deterministic garbage collection is not guaranteed by the language standard. -CHB
participants (7)
-
Barry Scott
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
Jonathan Fine
-
MRAB
-
Stephan Houben
-
Steven D'Aprano