Mailman 3 Deferred, coalescing, and other very recent reference counting optimization - Python-Dev

Deferred, coalescing, and other very recent reference counting optimization

Raihan Rasheed Apurbo

Sept. 1, 2020

2:52 a.m.

In CPython we have reference counting. My question is can we optimize current RC using strategies like Deferred RC and Coalescing? If no then where would I face problems if I try to implement these sorts of strategies? These strategies all depend on the concept that we don't need the exact value of reference count all the time. So far in my observation, we only need exact value before running a cycle collector. If we can manage to make sure that we have exact value before entering the cycle collector then in my opinion we can add these optimizations strategies to some extent. Is there something that I am missing? Or It is quite possible? If not possible please tell me the factors I should consider. Thanks in advance.

Show replies by date

Antoine Pitrou

September 2020

9:05 a.m.

Hi, On Tue, 01 Sep 2020 02:52:07 -0000 "Raihan Rasheed Apurbo" <apurbo97@gmail.com> wrote:

...

In CPython we have reference counting. My question is can we optimize current RC using strategies like Deferred RC and Coalescing? If no then where would I face problems if I try to implement these sorts of strategies?

You've already asked your question on python-ideas and already got answers there. In any case, the only definitive response to "would I face problems" can be obtained by trying it out. So, if you've got some time on your hands, I would encourage experimenting and share your results here. Regards Antoine.

Raihan Rasheed Apurbo

3:20 p.m.

Before experimenting can't I ask someone whether relevant experiments were made or not? if I am not supposed to do that then I totally agree with you, sir. I tried asking in python-ideas and one of the replies was "I gather there have been some experiments along these lines as part of efforts to remove the GIL. You might like to research them to find out how much success they've had." What I can interpret from this reply is there are some experiments regarding this topic but I don't know where they are. Then I asked him where could I find them but didn't get any answer to that. Then how have I got my answers? I am so sorry if I am asking any inappropriate question here.

Guido van Rossum

4:34 p.m.

On Tue, Sep 1, 2020 at 9:25 AM Raihan Rasheed Apurbo <apurbo97@gmail.com> wrote:

...

Before experimenting can't I ask someone whether relevant experiments were made or not? if I am not supposed to do that then I totally agree with you, sir.

I tried asking in python-ideas and one of the replies was "I gather there have been some experiments along these lines as part of efforts to remove the GIL. You might like to research them to find out how much success they've had."

What I can interpret from this reply is there are some experiments regarding this topic but I don't know where they are. Then I asked him where could I find them but didn't get any answer to that.

Then how have I got my answers?

I am so sorry if I am asking any inappropriate question here.

Raihan, Nobody knows the answers to your questions ("can we optimize...", "where would I face problems...", "is there something I am missing?", "tell me the factors I should consider"). Your questions are somehow presented as if we *should* know the answers, and seem to insinuate that we should have investigated the techniques you mention. The answer to this is that we are doubtful that it will be easy to implement them in a fully backwards compatible way (and you'd be surprised how important backwards compatibility is). The only thing I personally know about this topic is that Larry Hastings attempted a more sophisticated approach to multi-threaded reference counting without the GIL, and came up short despite putting a lot of effort in it. You should probably do some Googling on Gilectomy and find Larry's various annual updates. That is presumably what that python-ideas answer was referring to. Good luck. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Larry Hastings

7:13 p.m.

In today's CPython with the GIL, it's hard to beat the existing simple reference counting approach. It's already so cheap, and CPython's implementation has had a lot of optimization. I implemented "buffered reference counting" in the Gilectomy because it's far friendlier to multicore. It added an enormous amount of overhead. But with enough cores it would eventually become a performance win. I'm not optimistic that coalescing would be a win for CPython. On the C side, reference count change code is written by hand, so hopefully it already does an intelligent job without redundant reference changes. On the bytecode side, reference count changes are implicit, so there's nothing you can optimize. Changing CPython bytecode to make explicit reference count changes would add so much overhead--the generated bytecode would be much larger, you'd be dispatching so many reference count change opcodes--that it's hard to imagine it would be cheaper than just living with the redundant reference count changes. My memory is hazy, but IIRC "deferred" reference counting is a clever scheme in which stack references don't change the reference count, only heap references do. An object has to have a reference count of zero /and/ have no active references in the stack in order to be collected. This seems like a lovely idea, maximizing the best attributes of both reference counting and tracing garbage collection. However, this would require walking the stack(s), which we can't do in cross-platform ANSI C. It would also require a complete overhaul of the C code for CPython itself (and eventually all extensions), hand-reviewing every reference count change to decide "is this a heap or a stack reference"? I don't think there is "low-hanging fruit" left to optimize in CPython's reference counting scheme. So all that's left are major rewrites like "deferred" reference counting. Personally I think the future of CPython is to change completely over to tracing garbage collection. It's so much friendlier to multicore, which is clearly the future of programming. I'd rather see efforts in this area directed towards that goal. //arry/ On 9/1/20 9:34 AM, Guido van Rossum wrote:

...

On Tue, Sep 1, 2020 at 9:25 AM Raihan Rasheed Apurbo <apurbo97@gmail.com <mailto:apurbo97@gmail.com>> wrote:

Before experimenting can't I ask someone whether relevant experiments were made or not? if I am not supposed to do that then I totally agree with you, sir.

I tried asking in python-ideas and one of the replies was "I gather there have been some experiments along these lines as part of efforts to remove the GIL. You might like to research them to find out how much success they've had."

What I can interpret from this reply is there are some experiments regarding this topic but I don't know where they are. Then I asked him where could I find them but didn't get any answer to that.

Then how have I got my answers?

I am so sorry if I am asking any inappropriate question here.

Raihan,

Nobody knows the answers to your questions ("can we optimize...", "where would I face problems...", "is there something I am missing?", "tell me the factors I should consider").

Your questions are somehow presented as if we *should* know the answers, and seem to insinuate that we should have investigated the techniques you mention. The answer to this is that we are doubtful that it will be easy to implement them in a fully backwards compatible way (and you'd be surprised how important backwards compatibility is).

The only thing I personally know about this topic is that Larry Hastings attempted a more sophisticated approach to multi-threaded reference counting without the GIL, and came up short despite putting a lot of effort in it. You should probably do some Googling on Gilectomy and find Larry's various annual updates. That is presumably what that python-ideas answer was referring to.

Good luck.

-- --Guido van Rossum (python.org/~guido <http://python.org/~guido>) /Pronouns: he/him //(why is my pronoun here?)/ <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XSP4NZDT... Code of Conduct: http://python.org/psf/codeofconduct/

Neil Schemenauer

8:32 p.m.

On 2020-09-01, Larry Hastings wrote:

...

Personally I think the future of CPython is to change completely over to tracing garbage collection. It's so much friendlier to multicore, which is clearly the future of programming. I'd rather see efforts in this area directed towards that goal.

I think either CPython does that or some other implementation is going to displace it. CPython doesn't have a good way of utilizing multi-core CPUs. The various multi-process approaches don't solve the problem of efficiently passing data between threads of execution. An elegant approach would be to use message passing like is done by Erlang. However, given that Python is not a functional language and that most core data structures are mutable, it seems a poor fit. The most obvious approach is to adopt a multi-threaded model like is done by modern Java. I.e. no GIL and non-thread safe core data structures. That sounds a bit scary but based on Java experience it seems programmers can manage it. If it wasn't for CPython's reference counted GC, that kind of threading model seems relatively easy to implement. Getting libgc working would be a useful first step: https://discuss.python.org/t/switching-from-refcounting-to-libgc/1641 I guess it's hard to get people excited about that work because you have to go backwards in performance before you can possibily go forward. The non-reference counted CPython is going to be much slower and recovering that performance will be a long slog.

Greg Ewing

3:13 a.m.

On 2/09/20 8:32 am, Neil Schemenauer wrote:

...

The most obvious approach is to adopt a multi-threaded model like is done by modern Java. I.e. no GIL and non-thread safe core data structures. That sounds a bit scary but based on Java experience it seems programmers can manage it.

I think that depends on how non-thread-safe it is. If it's "weird things can happen" it might be all right. But if it's "can crash the interpreter" it might not be all right. -- Greg

Neil Schemenauer

6:11 a.m.

On 2020-09-02, Greg Ewing wrote:

...

On 2/09/20 8:32 am, Neil Schemenauer wrote:

...
The most obvious approach is to adopt a multi-threaded model like is done by modern Java. I.e. no GIL and non-thread safe core data structures. That sounds a bit scary but based on Java experience it seems programmers can manage it.

I think that depends on how non-thread-safe it is. If it's "weird things can happen" it might be all right. But if it's "can crash the interpreter" it might not be all right.

Weird things would include unexpected exceptions. This seems a relevant discussion: https://softwareengineering.stackexchange.com/questions/262428/race-conditio... The Java spec seems to contain the details but I admit I haven't studied them: https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html Getting exceptions if your locking is incorrect seems an okay tradeoff to me. My knowledge of Java is pretty limited but I believe they originally tried to make the core data structures thread-safe (e.g. maps, vectors). That turned out to be too difficult or too expensive. Instead, the core collection types are not thread-safe and they introduced new "concurrent" collections. That way, you only pay the cost of synchronization if you need it.

Greg Ewing

6:17 a.m.

On 2/09/20 6:11 pm, Neil Schemenauer wrote:

...

Weird things would include unexpected exceptions.

I don't mind unexpected exceptions, but I would mind core dumps. -- Greg

Raihan Rasheed Apurbo

5:44 p.m.

Thank you Larry Hastings for your wonderful detailed answer. Just the answer I was looking for. I am trying to work on memory management in python. CPython reference counting is very hard to modify. Before I start implementing something I just wanted to know whether anybody has tried similar things before and wanted to know their experience on this just so I don't end up implementing something which already was tried before me. Now that I know your thoughts on this. It would be helpful for me to find my way and decide where to put effort. I am so grateful for your answer. Thank you, sir. I apologize again if my way of asking was rude.

Raihan Rasheed Apurbo

5:07 p.m.

Thank you sir for directing me to Gilectomy. I will definitely take a look and that and try to design my own experimentations. I definitely could have asked for direction in a better way. I could have asked in a better approach. Thanks for pointing that out as well.

Eric V. Smith

4:42 p.m.

Search for “Larry Hastings gilectomy”. For example, there’s this: https://speakerdeck.com/pycon2017/larry-hastings-the-gilectomy-hows-it-going -- Eric V. Smith (301) 502-0945 cell

...

On Sep 1, 2020, at 12:21 PM, Raihan Rasheed Apurbo <apurbo97@gmail.com> wrote:

Before experimenting can't I ask someone whether relevant experiments were made or not? if I am not supposed to do that then I totally agree with you, sir.

I tried asking in python-ideas and one of the replies was "I gather there have been some experiments along these lines as part of efforts to remove the GIL. You might like to research them to find out how much success they've had."

What I can interpret from this reply is there are some experiments regarding this topic but I don't know where they are. Then I asked him where could I find them but didn't get any answer to that.

Then how have I got my answers?

I am so sorry if I am asking any inappropriate question here. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2IC27QRC... Code of Conduct: http://python.org/psf/codeofconduct/

Raihan Rasheed Apurbo

5:23 p.m.

Thank you, sir. Guido already mentioned it so I was going to search google about it. Thanks for your direct link as well. I really appreciate it. I will surely look into that. :)

Eric V. Smith

5:36 p.m.

On 9/2/2020 1:23 PM, Raihan Rasheed Apurbo wrote:

...

Thank you, sir. Guido already mentioned it so I was going to search google about it. Thanks for your direct link as well. I really appreciate it. I will surely look into that. :)

No problem. It would also help if you included a little context when you replied to messages. It's not clear who you're responding to here (although I think it's me). Eric

Raihan Rasheed Apurbo

5:49 p.m.

Sorry, Eric. V. Smith. I don't know how to quote someone like everyone is doing. I am new to mailman . I am working on it though. Thanks.

Tim Peters

5:44 p.m.

I'm surprised nobody has mentioned this: there are no "unboxed" types in CPython - in effect, every object user code creates is allocated from the heap. Even, e.g., integers and floats. So even non-contrived code can create garbage at a ferocious rate. For example, think about this simple function, which naively computes the Euclidean distance between two n-vectors: ```python def dist(xs, ys): from math import sqrt if len(xs) != len(ys): raise ValueError("inputs must have same length") return sqrt(sum((x - y)**2 for x, y in zip(xs, ys))) ``` In general, `len(xs)` and `len(ys)` each create a new integer object, which both become trash the instant `!=` completes. Say the common length is `n`. `zip` then creates `n` 2-tuple objects, each of which lives only long enough to be unpacked into `x` and `y`. The the result of `x-y` lives only long enough to be squared, and the result of that lives only long enough to be added into `sum()`'s internal accumulator. Finally, the grand total lives only long enough to extract its square root. With "immediate" reclamation of garbage via refcounting, memory use is trival regardless of how large `n` is, as CPython reuses the same heap space over & over & over, ASAP. The space for each 2-tuple is reclaimed before `x-y` is computed, the space for that is reclaimed when the square completes, and the space for the square is reclaimed right after `sum()` folds it in. It's memory-efficient and cache-friendly "in the small". Of course that's _assuming__, e.g., that `(x-y).__pow__(2)` doesn't save away its arguments somewhere that outlives the method call, but the compiler has no way to know whether it does. The only thing it can assume about the element types is that they support the methods the code invokes. It fact, the compiler has no idea whether the result type of `x-y` is even the same as the type of `x` or of `y`.

Brandt Bucher

6:17 p.m.

Tim Peters wrote:

...

`zip` then creates `n` 2-tuple objects, each of which lives only long enough to be unpacked into `x` and `y`... With "immediate" reclamation of garbage via refcounting, memory use is trival regardless of how large `n` is, as CPython reuses the same heap space over & over & over, ASAP. The space for each 2-tuple is reclaimed before `x-y` is computed...

It's also worth noting that the current refcounting scheme allows for some pretty sneaky optimizations under-the-hood. In your example, `zip` only ever creates one 2-tuple, and keeps reusing it over and over: https://github.com/python/cpython/blob/c96d00e88ead8f99bb6aa1357928ac4545d92... This is thanks to the fact that most `zip` usage looks exactly like yours, where the tuple is only around long enough to be unpacked. If `zip.__next__` knows it's not referenced anywhere else anymore, it is free to mutate(!) it in place. I believe PyUnicode_Append does something similar for string concatenation, as well.

Antoine Pitrou

10:40 a.m.

On Thu, 03 Sep 2020 18:17:12 -0000 "Brandt Bucher" <brandtbucher@gmail.com> wrote:

...

Tim Peters wrote:

...
`zip` then creates `n` 2-tuple objects, each of which lives only long enough to be unpacked into `x` and `y`... With "immediate" reclamation of garbage via refcounting, memory use is trival regardless of how large `n` is, as CPython reuses the same heap space over & over & over, ASAP. The space for each 2-tuple is reclaimed before `x-y` is computed...

It's also worth noting that the current refcounting scheme allows for some pretty sneaky optimizations under-the-hood. In your example, `zip` only ever creates one 2-tuple, and keeps reusing it over and over:

https://github.com/python/cpython/blob/c96d00e88ead8f99bb6aa1357928ac4545d92...

This code is a bit concerning, it means the objects yielded by zip() can be kept alive inside the zip instance until the next iteration: olditem = PyTuple_GET_ITEM(result, i); PyTuple_SET_ITEM(result, i, item); Py_DECREF(olditem); (in case of error during iteration, it seems this may even be worse) And tuple allocation uses freelists for small sizes, so I'm not even sure how useful that optimization is. Regards Antoine.

Serhiy Storchaka

7:18 a.m.

04.09.20 13:40, Antoine Pitrou пише:

...

And tuple allocation uses freelists for small sizes, so I'm not even sure how useful that optimization is.

I do not know about zip(), but I measured the overhead of tuple allocation in comparison of reusing tuple in other code, and it was not so small. Tuple *deallocation* adds a significant amount of overhead due to using the trashcan mechanism.

Jim J. Jewett

3:50 a.m.

Raihan Rasheed Apurbo wrote:

...

In CPython we have reference counting. My question is can we optimize current RC using strategies like Deferred RC and Coalescing?

You might also look at some of the older attempts to remove reference counting entirely in favor of (usually) the Boehm tracing collector. The "problem" is that adding 1 to an integer that is already in cache is pretty quick. The same goes for subtracting 1. So to beat the current system, you need something very fast *and* you need to take out some cost that isn't directly part of reference counting. If you're assuming multiple threads (or processes) on different cores, that is possible. You know about the Gilectomy by now. I suspect that splitting the reference count away from the object itself could also be profitable, as it means the cache won't have to be dirtied (and flushed) on read access, and can keep Copy-On-Write from duplicating pages. On the other hand, it also means that you'll (eventually, after your optimization strategies) have to issue extra loads for the reference count, instead of getting it into cache for free when you load the data.

...

If no then where would I face problems if I try to implement these sorts of strategies?

The current system is really very fast, particularly in a single-threaded environment. You probably won't see enough gains to make up for the complexity unless you do also reorganize memory. That isn't easy to do in incremental steps, or in a backwards-compatible way. But looking at how PyPy organizes its memory models may provide a rough map of something that works. (Or warn you of what doesn't work, if they say "can't use this model when using extension modules.")

...

These strategies all depend on the concept that we don't need the exact value of reference count all the time. So far in my observation, we only need exact value before running a cycle collector.

We also need to be sure the estimate is never too low, or at least that it never goes negative, and that it never hits 0 when it shouldn't. Being too high is fine, but it may lead to using a surprisingly large amount of extra memory, and breaking out of cache more often.

Larry Hastings

10:25 a.m.

On 9/2/20 8:50 PM, Jim J. Jewett wrote:

...

I suspect that splitting the reference count away from the object itself could also be profitable, as it means the cache won't have to be dirtied (and flushed) on read access, and can keep Copy-On-Write from duplicating pages.

I had a patch from Thomas Wouters doing that for the Gilectomy. Last time I tried it, it was a performance wash. //arry/

1599

Age (days ago)

1603

Last active (days ago)

List overview

Download

20 comments

11 participants

participants (11)

Antoine Pitrou
Brandt Bucher
Eric V. Smith
Greg Ewing
Guido van Rossum
Jim J. Jewett
Larry Hastings
Neil Schemenauer
Raihan Rasheed Apurbo
Serhiy Storchaka
Tim Peters

Deferred, coalescing, and other very recent reference counting optimization

tags

participants (11)