Mailman 3 For quicker execution, don't refcount objects that can't be deleted - Python-ideas

For quicker execution, don't refcount objects that can't be deleted

Jonathan Fine

June 13, 2020

5:36 a.m.

Hi Here's something that might make code run quicker. The basic idea is to not refcount some objects that are sure never to be deleted. On a multicore machine, this might significantly reduce the number of cache invalidations (and perhaps misses). I lack many of the skills needed to investigate this further. Aside: This idea prompted by: Make `del x` an expression evaluating to `x` https://mail.python.org/archives/list/python-ideas@python.org/message/WVQNAT... Consider

...

...
...
tuple(id(n) - id(0) for n in range(10)) (0, 32, 64, 96, 128, 160, 192, 224, 256, 288)

Why? Small integers are stored at fixed locations, hence the arithmetic progression. Let's look at refcounts.

...

...
...
import sys tuple(sys.getrefcount(n) for n in range(10)) (511, 837, 113, 54, 63, 35, 30, 20, 65, 17)

These refcounts are variable numbers. They can be changed (within limits).

...

...
...
x = [0] * 100000 tuple(sys.getrefcount(n) for n in range(10)) (100510, 837, 113, 54, 63, 35, 30, 20, 65, 17)

The same happens with None.

...

...
...
sys.getrefcount(None) 8475 x = [None] * 100000 sys.getrefcount(None) 108475

For me the basic idea of the implementation would be to not refcount those objects, whose id lies in a certain range. As stated earlier, I suspect the main benefit will be on multicore machines being able to make better use of per-core caches. If anyone is interested, I suggest starting with None, to get a rough estimate of the possible benefits (and cost of implementation). As well as the python-ideas thread mentioned above, related to this is: https://stackoverflow.com/questions/14139111/python-disable-reference-counti... -- Jonathan

Attachments:

attachment.htm (text/html — 2.3 KB)

Show replies by date

Christopher Barker

June 2020

2:20 p.m.

well, the problem is that the code calling the refcount doen'st know those objects are "special", and thus need to call Py_INCREF and Py_DECREF anyway. So are you suggesting that Py_INCREF and Py_DECREF do a check to see if the objects is "special" in this way, and not actually change the refcount? Would that really help performance at all? BTW, you could probably check this with little understanding of cPython internals by simply hacking those two functions (macros?) to be no-ops, and and see how it affects performance. -CHB On Sat, Jun 13, 2020 at 3:49 AM Jonathan Fine <jfine2358@gmail.com> wrote:

...

Hi

Here's something that might make code run quicker. The basic idea is to not refcount some objects that are sure never to be deleted. On a multicore machine, this might significantly reduce the number of cache invalidations (and perhaps misses). I lack many of the skills needed to investigate this further.

Aside: This idea prompted by: Make `del x` an expression evaluating to `x`

https://mail.python.org/archives/list/python-ideas@python.org/message/WVQNAT...

Consider

...
...
...
tuple(id(n) - id(0) for n in range(10)) (0, 32, 64, 96, 128, 160, 192, 224, 256, 288)

Why? Small integers are stored at fixed locations, hence the arithmetic progression. Let's look at refcounts.

...
...
...
import sys tuple(sys.getrefcount(n) for n in range(10)) (511, 837, 113, 54, 63, 35, 30, 20, 65, 17)

These refcounts are variable numbers. They can be changed (within limits).

...
...
...
x = [0] * 100000 tuple(sys.getrefcount(n) for n in range(10)) (100510, 837, 113, 54, 63, 35, 30, 20, 65, 17)

The same happens with None.

...
...
...
sys.getrefcount(None) 8475 x = [None] * 100000 sys.getrefcount(None) 108475

For me the basic idea of the implementation would be to not refcount those objects, whose id lies in a certain range. As stated earlier, I suspect the main benefit will be on multicore machines being able to make better use of per-core caches.

If anyone is interested, I suggest starting with None, to get a rough estimate of the possible benefits (and cost of implementation).

As well as the python-ideas thread mentioned above, related to this is:

https://stackoverflow.com/questions/14139111/python-disable-reference-counti...

-- Jonathan _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/RYMLY4... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Ben Rudiak-Gould

4:59 p.m.

There isn't really any contention for these memory locations in CPython as it stands because only one interpreter thread can run at a time. The only time a cache handoff is needed is during a thread switch when the new thread is scheduled on a different core, which is pretty rare (at CPU timescales). Adding checks to every incref/decref would probably cost more time than it would save. Something that might help performance a bit, and wouldn't hurt it, would be to drop explicit calls of Py_{INC,DEC}REF(Py_{None,False,True,...}), such as the ones in Py_RETURN_{NONE,FALSE,TRUE}, making these objects' refcounts into freefloating meaningless values. The refcounts would be initially set to a value far from zero, and on the rare occasions that they hit zero, the dealloc functions would just set them back to the initial value. Whether this would save enough time to be worth it, I don't know. (To avoid signed wraparound undefined behavior, you'd have to either change the refcount type from ssize_t to size_t, or else keep the DECREF calls and set the initial value to something like PY_SSIZE_T_MAX/2.)

Steve Barnes

12:11 a.m.

-----Original Message----- From: Ben Rudiak-Gould <benrudiak@gmail.com> Sent: 14 June 2020 22:59 To: Jonathan Fine <jfine2358@gmail.com> Cc: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Re: For quicker execution, don't refcount objects that can't be deleted There isn't really any contention for these memory locations in CPython as it stands because only one interpreter thread can run at a time. The only time a cache handoff is needed is during a thread switch when the new thread is scheduled on a different core, which is pretty rare (at CPU timescales). Adding checks to every incref/decref would probably cost more time than it would save. Something that might help performance a bit, and wouldn't hurt it, would be to drop explicit calls of Py_{INC,DEC}REF(Py_{None,False,True,...}), such as the ones in Py_RETURN_{NONE,FALSE,TRUE}, making these objects' refcounts into freefloating meaningless values. The refcounts would be initially set to a value far from zero, and on the rare occasions that they hit zero, the dealloc functions would just set them back to the initial value. Whether this would save enough time to be worth it, I don't know. (To avoid signed wraparound undefined behavior, you'd have to either change the refcount type from ssize_t to size_t, or else keep the DECREF calls and set the initial value to something like PY_SSIZE_T_MAX/2.) [Steve Barnes] Of course if we had a NaN value for integers, int('NaN'), then we could just set the initial count to it and since NaN - anything = NaN all would be golden.

Greg Ewing

1:22 a.m.

On 15/06/20 5:11 pm, Steve Barnes wrote:

...

Of course if we had a NaN value for integers, int('NaN'), then we could just set the initial count to it and since NaN - anything = NaN all would be golden.

Or we could use floating-point reference counts... -- Greg

Steven D'Aprano

1:55 a.m.

On Mon, Jun 15, 2020 at 06:22:34PM +1200, Greg Ewing wrote:

...

On 15/06/20 5:11 pm, Steve Barnes wrote:

...
Of course if we had a NaN value for integers, int('NaN'), then we could just set the initial count to it and since NaN - anything = NaN all would be golden.

Or we could use floating-point reference counts...

Remind me again, was the aim to speed up the interpreter or slow it down? :-) -- Steven

Steve Barnes

2:07 a.m.

-----Original Message----- From: Greg Ewing <greg.ewing@canterbury.ac.nz> Sent: 15 June 2020 07:23 To: python-ideas@python.org Subject: [Python-ideas] Re: For quicker execution, don't refcount objects that can't be deleted On 15/06/20 5:11 pm, Steve Barnes wrote:

...

Of course if we had a NaN value for integers, int('NaN'), then we could just set the initial count to it and since NaN - anything = NaN all would be golden.

Or we could use floating-point reference counts... -- Greg [Steve Barnes] I thought of floating-point reference counts but: a) 65535.0 -= 1 is slower than 65535 =- 1 (7.6% on my system quite a bit worse on some embedded systems). b) There comes a point in floats, (for large values of x), where x - 1 == x (about 10**15 which for some scientific & big data calculations is not that big) but unless reference counting uses python integers, rather than C, this will not be an issue. c) of course I like the concept of integer nan, inf, etc. 😊

Barry Scott

11:16 a.m.

...

On 14 Jun 2020, at 22:59, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:

There isn't really any contention for these memory locations in CPython as it stands because only one interpreter thread can run at a time. The only time a cache handoff is needed is during a thread switch when the new thread is scheduled on a different core, which is pretty rare (at CPU timescales). Adding checks to every incref/decref would probably cost more time than it would save.

The problem is when you fork a python process. Each of the child processes you would hope shared the state of the parent that is not being changed. But because of ref counting even unchanging objects get modified by a ref count inc/dec cycle and then the page that the object is in is copy-on-write'ed. End result is that a children share no pages with the parent. Barry

Christopher Barker

2:48 p.m.

On Mon, Jun 15, 2020 at 9:21 AM Barry Scott <barry@barrys-emacs.org> wrote:

...

The problem is when you fork a python process.

Each of the child processes you would hope shared the state of the parent that is not being changed. But because of ref counting even unchanging objects get modified by a ref count inc/dec cycle and then the page that the object is in is copy-on-write'ed.

End result is that a children share no pages with the parent.

I'm out of my depth here, but: how many immortal objects are there? Quite a few, but they are small, yes? (None, False, True, small integers, ....) and the copy-on-write happens at the page scale (~4096k ???). So would having a bunch of small immortal objects that don't get altered really help? Maybe if they were organised to be all together. It seems this could make a much more substantial difference if the user could mark certain objects immortal. But that would be pretty tricky -- as Python typically has a lot of small objects in containers -- how would you mark them all? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Barry Scott

4:17 p.m.

...

On 15 Jun 2020, at 20:48, Christopher Barker <pythonchb@gmail.com> wrote:

On Mon, Jun 15, 2020 at 9:21 AM Barry Scott <barry@barrys-emacs.org <mailto:barry@barrys-emacs.org>> wrote: The problem is when you fork a python process.

Each of the child processes you would hope shared the state of the parent that is not being changed. But because of ref counting even unchanging objects get modified by a ref count inc/dec cycle and then the page that the object is in is copy-on-write'ed.

End result is that a children share no pages with the parent.

I'm out of my depth here, but:

how many immortal objects are there? Quite a few, but they are small, yes? (None, False, True, small integers, ....) and the copy-on-write happens at the page scale (~4096k ???). So would having a bunch of small immortal objects that don't get altered really help? Maybe if they were organised to be all together.

The PR that Guido pointed to works by no longer doing INC/DEC on objects after a special call that marks all existing objects as immortal. This has the effect of allowing all the code and data that has been loaded to shared pages of memory. This can be 100s of MiB of ram in the interesting cases.

...

It seems this could make a much more substantial difference if the user could mark certain objects immortal. But that would be pretty tricky -- as Python typically has a lot of small objects in containers -- how would you mark them all?

Only if they are on a page of memory that is not shared by objects that does not contain object mortal objects. That was a discussion on moving the ref counts out of the objects and into separates pages so that the objects that do not change are not Copy-On-Write (COW) un-shared. I think that if ref counts survive in CPython then this is a very interesting way out of the COW problem. Barry

Greg Ewing

1:07 a.m.

On 16/06/20 4:16 am, Barry Scott wrote:

...

even unchanging objects get modified by a ref count inc/dec cycle and then the page that the object is in is copy-on-write'ed.

End result is that a children share no pages with the parent.

I wonder if it's worth trying to do anything about that? Like maybe keeping the refcounts all together in a memory area separate from the objects. -- Greg

Jonathan Fine

3:34 a.m.

Many thanks to all for your useful and interesting contributions. They've helped me, and given me a lot to think about. It's clear that there are several related or overlapping issues. Here are some comments. 1. A major problem is to speed-up A, without slowing B. In other words, how to avoid refcounting None, without slowing down other refcounts. (I'm reminded of putting a sentinel at the end of a list, to speed up searching for an item in the list.) 2. Which objects have busy refcounts? They are a potential target for optimization, even at the cost of slowing refcounting for the other objects. 3. The pull request that Guido referenced is quite small, and gives me some hope that I could code a patch that explores some aspect of the problem (such as gathering refcount stats). 4. Keeping the refcounts separate from the objects is looking more attractive. It would ease the CoW problem addressed in the pull request. It also makes multi-core reference count garbage collection easier (see this thread, which I started in July 2018, https://mail.python.org/archives/list/python-ideas@python.org/message/54AIYM... ). 5. Even without removing the GIL, if a second core could do the garbage collection reference counts, that would probably boost the performance of the first core (the one that does the useful work). 6. I think that statistics about the hot-spots would reduce the role for guessing and opinion. I think toy examples implemented and tested would increase our understanding. 7. This problem is harder than I thought. To be successful, my suggestion needs a cheap way to avoid unnecessary refcount operations. 8. If all objects are eternal, then we don't need any refcounts. But then we don't have any garbage collection either! It seems to me that much of the problem lies in the interface between eternal and transient objects. Mutable eternals are particularly problematic. 9. Relevant here is https://en.wikipedia.org/wiki/Flyweight_pattern, although the purpose is somewhat different. 10. Choosing the right problem to solve is very important. This could be subtle. I seem to have leapt at an obvious, but difficult (and perhaps insoluble) problem. (We can avoid the unnecessary refcounting, but perhaps not cheaply enough.) I've changed my opinion. I doubt that there's an easy win here. But I suspect that there is a win that is worth having. I'll now go away and think about things for a while. Thank you all, for the useful contributions. -- Jonathan

Antoine Pitrou

4:24 a.m.

On Tue, 16 Jun 2020 09:34:59 +0100 Jonathan Fine <jfine2358@gmail.com> wrote:

...

I've changed my opinion. I doubt that there's an easy win here. But I suspect that there is a win that is worth having. I'll now go away and think about things for a while.

This is a good idea. An even better idea, for people who think this is an interesting path, would be to actually experiment and then publish their observations somewhere. Right now it seems python-ideas is hosting the same discussion about immortal objects, rehashing the exact same ideas, roughly every 6 months. That's not very productive. Regards Antoine.

Guido van Rossum

12:32 p.m.

Did Larry Hastings ever publish his experience with separating the refcount for his Gilectomy project? I believe he told us at a sprint or language summit that he had this working, with much effort, but hadn't managed to make it faster than the existing refcounting implementation yet. On Tue, Jun 16, 2020 at 2:36 AM Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Tue, 16 Jun 2020 09:34:59 +0100 Jonathan Fine <jfine2358@gmail.com> wrote:

...
I've changed my opinion. I doubt that there's an easy win here. But I suspect that there is a win that is worth having. I'll now go away and think about things for a while.

This is a good idea. An even better idea, for people who think this is an interesting path, would be to actually experiment and then publish their observations somewhere. Right now it seems python-ideas is hosting the same discussion about immortal objects, rehashing the exact same ideas, roughly every 6 months. That's not very productive.

Regards

Antoine.

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CB7AS4... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Jonathan Fine

12:45 p.m.

Guido wrote:

...

Did Larry Hastings ever publish his experience with separating the refcount for his Gilectomy project? I believe he told us at a sprint or language summit that he had this working, with much effort, but hadn't managed to make it faster than the existing refcounting implementation yet.

There is his talk: The Gilectomy How's It Going: PyCon 2017 https://www.youtube.com/watch?v=pLqv11ScGsQ Here's a list of all his talks. (I was in the audience for one of them.) https://pyvideo.org/speaker/larry-hastings.html I hope this helps. As I recall, it gives the information Guido is asking for. -- Jonathan

Guido van Rossum

5:19 p.m.

There's a PR for this: "Immortal instances", by Eddie Elizondo (Facebook). Github PR: https://github.com/python/cpython/pull/19474 Bug Report: https://bugs.python.org/issue40255 On Sat, Jun 13, 2020 at 3:48 AM Jonathan Fine <jfine2358@gmail.com> wrote:

...

Hi

Here's something that might make code run quicker. The basic idea is to not refcount some objects that are sure never to be deleted. On a multicore machine, this might significantly reduce the number of cache invalidations (and perhaps misses). I lack many of the skills needed to investigate this further.

Aside: This idea prompted by: Make `del x` an expression evaluating to `x`

https://mail.python.org/archives/list/python-ideas@python.org/message/WVQNAT...

Consider

...
...
...
tuple(id(n) - id(0) for n in range(10)) (0, 32, 64, 96, 128, 160, 192, 224, 256, 288)

Why? Small integers are stored at fixed locations, hence the arithmetic progression. Let's look at refcounts.

...
...
...
import sys tuple(sys.getrefcount(n) for n in range(10)) (511, 837, 113, 54, 63, 35, 30, 20, 65, 17)

These refcounts are variable numbers. They can be changed (within limits).

...
...
...
x = [0] * 100000 tuple(sys.getrefcount(n) for n in range(10)) (100510, 837, 113, 54, 63, 35, 30, 20, 65, 17)

The same happens with None.

...
...
...
sys.getrefcount(None) 8475 x = [None] * 100000 sys.getrefcount(None) 108475

For me the basic idea of the implementation would be to not refcount those objects, whose id lies in a certain range. As stated earlier, I suspect the main benefit will be on multicore machines being able to make better use of per-core caches.

If anyone is interested, I suggest starting with None, to get a rough estimate of the possible benefits (and cost of implementation).

As well as the python-ideas thread mentioned above, related to this is:

https://stackoverflow.com/questions/14139111/python-disable-reference-counti...

-- Jonathan _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/RYMLY4... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

1714

Age (days ago)

1717

Last active (days ago)

List overview

Download

15 comments

9 participants

participants (9)

Antoine Pitrou
Barry Scott
Ben Rudiak-Gould
Christopher Barker
Greg Ewing
Guido van Rossum
Jonathan Fine
Steve Barnes
Steven D'Aprano

For quicker execution, don't refcount objects that can't be deleted

tags

participants (9)