[Python-ideas] Multi-core reference count garbage collection

Wed Jul 25 06:26:54 EDT 2018

Hi All

INTRODUCTION

This is the third and concluding post, where I describe a scheme for
multi-core reference counting garbage collection. The first two posts
are
https://mail.python.org/pipermail/python-ideas/2018-July/052054.html
https://mail.python.org/pipermail/python-ideas/2018-July/052151.html

This subject is quite technical, with pitfalls and a race hazard. If
you don't understand what I've written, or think it wrong, it might be
that I screwed up and got it wrong. I personally find it a hard
subject. Or it might be that you've missed something in one of the two
previous posts. Or we might both be wrong. Or both right!

For the race hazard, see
https://mail.python.org/pipermail/python-ideas/2018-July/052190.html

There have been many valuable contributions to this discussion thread,
which I will acknowledge later, in another post. However, I do repeat
that this scheme is based on other people's work (particularly talks
by Larry Hastings), as well as my own thinking.

And it is actually a known and published idea (which is reassuring).
https://mail.python.org/pipermail/python-ideas/2018-July/052061.html

THE STORY SO FAR

We have 5 worker processes and one garbage collection (GC) process.
Each worker process has an INCR buffer in which it logs an ID, every
time the process increases the refcount of the object with that ID.
And similarly for DECR.

When the worker INCR and DECR buffers are full, they are passed over
to the GC process. The GC process keeps, for each object ID, a running
total of how many references. In this way it consumes the information
in the INCR and DECR buffers.

The problem, as always in garbage collection, is to reclaim objects
that will no longer be used. With reference counting, objects whose
reference count is zero can be reclaimed. This is often done, in
single process systems, immediately after the reference count becomes
zero. (An important aside: For multi-core machines this is an
unhelpful constraint. Relaxing it will allow the worker processes to
run more freely.)

At each point in time, there are for each object ID two reference
counts. The first is the running total, updated as full INCR and DECR
buffers are applied to it. I will call this the RAW TOTAL. The second
is the raw total, adjusted by all the INCR and DECR buffers in the
worker processes. I will call this the REAL TOTAL.

Once the real total is zero, it stays zero. This is an immediate
consequence of the worker threads having no references to the object.
Having no references, they cannot do an INCR or DECR. When the real
total is zero, the object can be reclaimed.

The problem is to refine what we have, so that all objects whose real
count is zero are in good time reclaimed. We want to avoid locking all
the worker threads, for example to compute real counts for all IDs.

BEING TRICKY

Think of the worker threads as an adversary, who are trying to delay
or fool the garbage collector.

Here's one thing that can be done. Have a worker thread
1. Acquire (references to) some objects (say from other worker processes).
2. Delete those object references, without filling either the INCR or
DECR buffer.
3. Spin its wheels endlessly.

This will defeat the GC process, unless the system from time to time
sends the INCR and DECR buffers to the GC process, whether or not they
are full. So we'll add this to the system requirements.

Here's something else, which is part of normal use. We have one worker
thread do a calculation for another. The return value is passed from
one thread to another. It could happen that, for some ID, an INCR in
one thread is followed by a DECR in another. Now suppose that the GC
process gets
* First, the process buffer containing the DECR.
* And then the process buffer that contains the INCR.
(This is a race hazard. The order of operations has been reversed.)

Thus it can happen that the raw total for an ID can go down to zero
and then increase up to one. This can't happen for the true total, of
course. (With more work, the raw counts can go negative!)

DEFEATING TRICKINESS

The problem is for the GC process to discover IDs for which the real
total is zero. And in good time. Here's how we'll do it. Suppose an ID
has raw total zero. That makes it a candidate ID.

Now do a global INCR update. By this we mean that we have the GC
thread get INCR buffers from all the worker threads, and applies them
to the raw totals.

If the raw total for the candidate ID (drum roll) is still zero, then
the real total is also zero.

That's it. The rest is implementation.

Let's see why this is. First, note that we don't need to know exactly
when the real count became zero. We just need to know that it is now
zero. (And once zero, it stays zero.)

Now note that the real count can never be negative. Now suppose at
time T the raw count (for an ID) is zero, but that the real count is
one or more. That means that some worker process has a reference, and
so the ID appears in some INCR buffer.

Well, we just did a global INCR update, and found that the candidate
ID was still zero. So the ID wasn't in an INCR buffer. We can reclaim
the memory.

A MATHEMATICAL APPROACH

The previous argument was perhaps a computer science approach. Here's
a more mathematical approach.

We are using a special property of global INCR update. Suppose that at
time T the raw count is N. Now do a global INCR update, to get raw
count M. Let R be the real count at time T.

We have
* N <= M, because processing only INCR buffers.
* R <= M, because
** we skipped the DECR buffers
** and at most enlarged the INCR buffers

We always have 0 <= R, and after the global INCR update we have R <=
M. That was at our time T, when we started the global INCR update. But
once zero, R stays zero.

So M == 0 is our condition for garbage collection, when M is the value
after a global INCR update.

WHERE NOW

For me, this thread has been an exploration. I'm now convinced,
although you may not be, that buffered multi-core reference counting
garbage collections is something that can be done. And that it will
allow the worker processes to run freely, at least when it comes to
garbage collection.

This discussion thread contributes nothing to important problem of
multiple processes mutating (writing to) the same object. (Except that
buffering provides a solution in the special case where order doesn't
matter, such 1 + 2 + 3 = 2 + 3 + 1.)

Well done for getting to the end of this long and difficult material.
I've written it with extra special care, but expect that it can still
be made clearer, and might have errors.

I'd be delighted if this discussion thread helps us make CPython run
faster on multi-core machines, at least for some users. If you've got
some ideas about that, I'd love to hear them.

Questions and comments are, of course, welcome. Especially corrections
and improvements.

-- 
Jonathan