Tagged pointer experiment: need help to optimize
Hi, I need to help to attempt to optimize my experimental CPython fork which uses tagged pointers. When I proposed my PEP 620 "Hide implementation details from the C API", I was asked about a proof that the PEP unlocks real optimization possibilities. So I wrote an implementation of tagged pointers: https://github.com/vstinner/cpython/pull/6 The main benefit is the memory usage. For example, list(range(200)) uses 1656 bytes instead of 7262 (4x less memory). Sadly, my current simple implementation is 1.1x slower than the reference. I suspect that adding a condition to Py_INCREF() and Py_DECREF() explains a large part of this overhead. My implementation uses tagged pointers for: * integers in the range: [-5; 256] * None, True and False singletons It would be nice to use tagged pointers for a wide range of integer numbers, but I wrote a simple implementation: _Py_TAGPTR_UNBOX() has to return a borrowed reference. This function should return a strong reference to support a larger range. More information in the document: https://github.com/vstinner/cpython/blob/tagged_ptr/TAGGED_POINTERS.rst Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Mon, 21 Sep 2020 20:35:33 +0200 Victor Stinner <vstinner@python.org> wrote:
When I proposed my PEP 620 "Hide implementation details from the C API", I was asked about a proof that the PEP unlocks real optimization possibilities. So I wrote an implementation of tagged pointers: https://github.com/vstinner/cpython/pull/6
The main benefit is the memory usage. For example, list(range(200)) uses 1656 bytes instead of 7262 (4x less memory).
Hmm, how come? Aren't those tiny integers singletons already? I suppose you're thinking of something like `list(range(2000, 2200))`.
Sadly, my current simple implementation is 1.1x slower than the reference. I suspect that adding a condition to Py_INCREF() and Py_DECREF() explains a large part of this overhead.
And adding a condition in every place an object is inspected. Even something as simple as Py_TYPE() is not a mere lookup anymore.
It would be nice to use tagged pointers for a wide range of integer numbers, but I wrote a simple implementation: _Py_TAGPTR_UNBOX() has to return a borrowed reference. This function should return a strong reference to support a larger range.
Hmm, it sounds a bit weird. The point of tagged pointers, normally, is to avoid creating objects at all. If you create an object dynamically each time a tagged pointer is "dereferenced", then I suspect you won't gain anything. Regards Antoine.
Le lun. 21 sept. 2020 à 21:46, Antoine Pitrou <solipsis@pitrou.net> a écrit :
The main benefit is the memory usage. For example, list(range(200)) uses 1656 bytes instead of 7262 (4x less memory).
Hmm, how come? Aren't those tiny integers singletons already?
These numbers come from the code: import sys def size(l): return sys.getsizeof(l) + sum(sys.getsizeof(item) for item in l) print(size(list(range(200))), "bytes") The code doesn't ignore the object size of singletons but count their memory as if they are allocated on the heap memory. It took a shortcut: I counted as if my implementation would support any number, and not only existing Python singletons. On my current implementation, list(range(200))) doesn't use more or less memory, but the same amount.
Sadly, my current simple implementation is 1.1x slower than the reference. I suspect that adding a condition to Py_INCREF() and Py_DECREF() explains a large part of this overhead.
And adding a condition in every place an object is inspected. Even something as simple as Py_TYPE() is not a mere lookup anymore.
You're right that Py_TYPE() also gets a condition to check if the pointer is tagged: https://github.com/vstinner/cpython/blob/tagged_ptr/Include/object.h#L203 By the way, Py_INCREF() and Py_DECREF() use: if (_Py_TAGPTR_IS_TAGGED(op)) { // Tagged pointers are immutable return; } https://github.com/vstinner/cpython/blob/tagged_ptr/Include/object.h#L497
It would be nice to use tagged pointers for a wide range of integer numbers, but I wrote a simple implementation: _Py_TAGPTR_UNBOX() has to return a borrowed reference. This function should return a strong reference to support a larger range.
Hmm, it sounds a bit weird. The point of tagged pointers, normally, is to avoid creating objects at all. If you create an object dynamically each time a tagged pointer is "dereferenced", then I suspect you won't gain anything.
Sure, functions could be optimized to avoid the creation of temporary objects. I wrote a simple implementation which leaves the code as it is, but "unbox" tagged pointers when a function access directly object members. Example in listobject.c: vl = (PyLongObject*)_Py_TAGPTR_UNBOX(v); wl = (PyLongObject*)_Py_TAGPTR_UNBOX(w); v0 = Py_SIZE(vl) == 0 ? 0 : (sdigit)vl->ob_digit[0]; w0 = Py_SIZE(wl) == 0 ? 0 : (sdigit)wl->ob_digit[0]; "vl->ob_digit" crashes if vl is a tagged pointer. My first goal was to write a *correct* (working) implementation, not really an optimized implementation. That's why I'm calling for help to attempt to optimize it ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On 22/09/20 10:06 pm, Victor Stinner wrote:
I wrote a simple implementation which leaves the code as it is, but "unbox" tagged pointers when a function access directly object members. Example in listobject.c:
vl = (PyLongObject*)_Py_TAGPTR_UNBOX(v); wl = (PyLongObject*)_Py_TAGPTR_UNBOX(w); v0 = Py_SIZE(vl) == 0 ? 0 : (sdigit)vl->ob_digit[0]; w0 = Py_SIZE(wl) == 0 ? 0 : (sdigit)wl->ob_digit[0];
I think you're using the terms "box" and "unbox" the opposite way from most people. Usually a "boxed" type is one that lives on the heap, and an "unboxed" one doesn't.
My first goal was to write a *correct* (working) implementation, not really an optimized implementation.
That's why I'm calling for help to attempt to optimize it ;-)
What are you trying to achieve by using tagged pointers? It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere. We already have ways of efficiently storing collections of ints and floats -- array.array, numpy, etc. -- and you only pay for the overhead of those if you need them. -- Greg
On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
What are you trying to achieve by using tagged pointers?
It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere.
Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) uses tagged pointers to encode integers up to 63 bits. I think it's done for speed, and it's probably faster in part because it avoids slow memory accesses. But (a) I don't think there's overflow checking, and (b) mypyc is very careful that tagged integers are never passed to the CPython runtime (since mypyc apps link with an unmodified CPython runtime for data types, compatibility with extensions and pure Python code). Nevertheless I think it puts your blanket claim in some perspective. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Wed, 23 Sep 2020 at 01:19, Guido van Rossum <guido@python.org> wrote:
On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
What are you trying to achieve by using tagged pointers?
It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere.
Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) uses tagged pointers to encode integers up to 63 bits. I think it's done for speed, and it's probably faster in part because it avoids slow memory accesses. But (a) I don't think there's overflow checking, and (b) mypyc is very careful that tagged integers are never passed to the CPython runtime (since mypyc apps link with an unmodified CPython runtime for data types, compatibility with extensions and pure Python code). Nevertheless I think it puts your blanket claim in some perspective.
FWIW mypyc does overflow checking, see e.g. https://github.com/python/mypy/blob/master/mypyc/lib-rt/CPy.h#L168, also tagged pointers did bring some speed wins, not big however. My expectation is that speed wins may be even more modest without static information that mypyc has. // offtopic below, sorry In general, I think any significant perf wins will require some static info given to the Python compiler. I was thinking a long while ago about defining some kind of a standard protocol so that static type checkers can optionally provide some info to the Python compiler (e.g. pre-annotated ASTs and pre-annotated symbol tables), and having a lot of specialized bytecodes. For example, if we know x is a list and y is an int, we can emit a special byte code for x[y] that will call PyList_GetItem, and will fall back to PyObject_GetItem in rare cases when type-checker didn't infer right (or there were some Any involved). Another example is having special byte codes for direct pointer access to instance attributes, etc. The main downside of such ideas is it will take a lot of work to implement. -- Ivan
On 23/09/2020 9:47 am, Ivan Levkivskyi wrote:
On Wed, 23 Sep 2020 at 01:19, Guido van Rossum <guido@python.org <mailto:guido@python.org>> wrote:
On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>> wrote:
What are you trying to achieve by using tagged pointers?
It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere.
Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) uses tagged pointers to encode integers up to 63 bits. I think it's done for speed, and it's probably faster in part because it avoids slow memory accesses. But (a) I don't think there's overflow checking, and (b) mypyc is very careful that tagged integers are never passed to the CPython runtime (since mypyc apps link with an unmodified CPython runtime for data types, compatibility with extensions and pure Python code). Nevertheless I think it puts your blanket claim in some perspective.
FWIW mypyc does overflow checking, see e.g. https://github.com/python/mypy/blob/master/mypyc/lib-rt/CPy.h#L168, also tagged pointers did bring some speed wins, not big however. My expectation is that speed wins may be even more modest without static information that mypyc has.
// offtopic below, sorry
In general, I think any significant perf wins will require some static info given to the Python compiler. I was thinking a long while ago about defining some kind of a standard protocol so that static type checkers can optionally provide some info to the Python compiler (e.g. pre-annotated ASTs and pre-annotated symbol tables), and having a lot of specialized bytecodes. For example, if we know x is a list and y is an int, we can emit a special byte code for x[y] that will call PyList_GetItem, and will fall back to PyObject_GetItem in rare cases when type-checker didn't infer right (or there were some Any involved). Another example is having special byte codes for direct pointer access to instance attributes, etc. The main downside of such ideas is it will take a lot of work to implement.
Performance improvements do not need static annotations, otherwise PyPy, V8 and luajit wouldn't exist. Even HotSpot was originally based on a VM for SmallTalk. Cheers, Mark.
On Wed, 23 Sep 2020 at 11:42, Mark Shannon <mark@hotpy.org> wrote:
On 23/09/2020 9:47 am, Ivan Levkivskyi wrote:
On Wed, 23 Sep 2020 at 01:19, Guido van Rossum <guido@python.org <mailto:guido@python.org>> wrote:
On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>> wrote:
What are you trying to achieve by using tagged pointers?
It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere.
Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) uses tagged pointers to encode integers up to 63 bits. I think it's done for speed, and it's probably faster in part because it avoids slow memory accesses. But (a) I don't think there's overflow checking, and (b) mypyc is very careful that tagged integers are never passed to the CPython runtime (since mypyc apps link with an unmodified CPython runtime for data types, compatibility with extensions and pure Python code). Nevertheless I think it puts your blanket claim in some perspective.
FWIW mypyc does overflow checking, see e.g. https://github.com/python/mypy/blob/master/mypyc/lib-rt/CPy.h#L168,
also
tagged pointers did bring some speed wins, not big however. My expectation is that speed wins may be even more modest without static information that mypyc has.
// offtopic below, sorry
In general, I think any significant perf wins will require some static info given to the Python compiler. I was thinking a long while ago about defining some kind of a standard protocol so that static type checkers can optionally provide some info to the Python compiler (e.g. pre-annotated ASTs and pre-annotated symbol tables), and having a lot of specialized bytecodes. For example, if we know x is a list and y is an int, we can emit a special byte code for x[y] that will call PyList_GetItem, and will fall back to PyObject_GetItem in rare cases when type-checker didn't infer right (or there were some Any involved). Another example is having special byte codes for direct pointer access to instance attributes, etc. The main downside of such ideas is it will take a lot of work to implement.
Performance improvements do not need static annotations, otherwise PyPy, V8 and luajit wouldn't exist. Even HotSpot was originally based on a VM for SmallTalk.
Sure, but JIT optimizations assume there are some "hot spots" in the code where e.g. a function is called in a loop, so that type information can be gathered and re-used. The problem is that in my experience there are many applications where this is not the case: there are no major hot spots. For such applications JITs will not be efficient, while static annotations will work. Another thing is that making CPython itself JITted may be even harder than adding some (opt-in) static based optimizations, but I am clearly biased here. Actually what would be really cool is having both: i.e. have a JIT that would use static annotations to speed-up the warmup significantly. I don't know if anyone tried something like this, but it doesn't sound impossible. -- Ivan
Le mer. 23 sept. 2020 à 13:49, Ivan Levkivskyi <levkivskyi@gmail.com> a écrit :
Sure, but JIT optimizations assume there are some "hot spots" in the code where e.g. a function is called in a loop, so that type information can be gathered and re-used. The problem is that in my experience there are many applications where this is not the case: there are no major hot spots. For such applications JITs will not be efficient, while static annotations will work.
Another thing is that making CPython itself JITted may be even harder than adding some (opt-in) static based optimizations, but I am clearly biased here.
CPython has a JIT compiler since Python 3.8 :-) When a code object is executed more than 1024 times, a cache is created for LOAD_GLOBAL instructions. The What's New in Python 3.8 entry says that LOAD_GLOBAL "is about 40% faster now": https://docs.python.org/dev/whatsnew/3.8.html#optimizations Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On 23/09/2020 12:46 pm, Ivan Levkivskyi wrote:
On Wed, 23 Sep 2020 at 11:42, Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
On 23/09/2020 9:47 am, Ivan Levkivskyi wrote: > > > On Wed, 23 Sep 2020 at 01:19, Guido van Rossum <guido@python.org <mailto:guido@python.org> > <mailto:guido@python.org <mailto:guido@python.org>>> wrote: > > On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing > <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz> <mailto:greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>>> > wrote: > > What are you trying to achieve by using tagged pointers? > > It seems to me that in a dynamic environment like Python, tagged > pointer tricks are only ever going to reduce memory usage, not > make anything faster, and in fact can only make things slower > if applied everywhere. > > > Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) > uses tagged pointers to encode integers up to 63 bits. I think it's > done for speed, and it's probably faster in part because it avoids > slow memory accesses. But (a) I don't think there's overflow > checking, and (b) mypyc is very careful that tagged integers are > never passed to the CPython runtime (since mypyc apps link with an > unmodified CPython runtime for data types, compatibility with > extensions and pure Python code). Nevertheless I think it puts your > blanket claim in some perspective. > > > FWIW mypyc does overflow checking, see e.g. > https://github.com/python/mypy/blob/master/mypyc/lib-rt/CPy.h#L168, also > tagged pointers did bring some speed wins, not big however. My > expectation is that speed wins may be even more modest without static > information that mypyc has. > > // offtopic below, sorry > > In general, I think any significant perf wins will require some static > info given to the Python compiler. I was thinking a long while ago about > defining some kind of a standard protocol so that static type checkers > can optionally provide some info to the Python compiler (e.g. > pre-annotated ASTs and pre-annotated symbol tables), and having a lot of > specialized bytecodes. For example, if we know x is a list and y is an > int, we can emit a special byte code for x[y] that will call > PyList_GetItem, and will fall back to PyObject_GetItem in rare cases > when type-checker didn't infer right (or there were some Any involved). > Another example is having special byte codes for direct pointer access > to instance attributes, etc. The main downside of such ideas is it will > take a lot of work to implement.
Performance improvements do not need static annotations, otherwise PyPy, V8 and luajit wouldn't exist. Even HotSpot was originally based on a VM for SmallTalk.
Sure, but JIT optimizations assume there are some "hot spots" in the code where e.g. a function is called in a loop, so that type information can be gathered and re-used. The problem is that in my experience there are many applications where this is not the case: there are no major hot spots. For such applications JITs will not be efficient, while static annotations will work.
If an application has no hot code, then it will just terminate quickly and not need optimizing. Even a slow interpreter like CPython can execute millions of LOC per second.
Another thing is that making CPython itself JITted may be even harder than adding some (opt-in) static based optimizations, but I am clearly biased here.
Actually what would be really cool is having both: i.e. have a JIT that would use static annotations to speed-up the warmup significantly. I don't know if anyone tried something like this, but it doesn't sound impossible.
Not impossible, just not worthwhile. Cheers, Mark.
On Wed, 23 Sep 2020 12:46:25 +0100 Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
Another thing is that making CPython itself JITted may be even harder than adding some (opt-in) static based optimizations, but I am clearly biased here.
Actually what would be really cool is having both: i.e. have a JIT that would use static annotations to speed-up the warmup significantly.
Numba allows you to specify the signature of a function (i.e. the concrete argument types it accepts). That said, the goal is not to speed up any warmup phase, but to override type inference when it doesn't produce the desired results. Regards Antoine.
Le mer. 23 sept. 2020 à 10:48, Ivan Levkivskyi <levkivskyi@gmail.com> a écrit :
// offtopic below, sorry
In general, I think any significant perf wins will require some static info given to the Python compiler. I was thinking a long while ago about defining some kind of a standard protocol so that static type checkers can optionally provide some info to the Python compiler (e.g. pre-annotated ASTs and pre-annotated symbol tables), and having a lot of specialized bytecodes. For example, if we know x is a list and y is an int, we can emit a special byte code for x[y] that will call PyList_GetItem, and will fall back to PyObject_GetItem in rare cases when type-checker didn't infer right (or there were some Any involved). Another example is having special byte codes for direct pointer access to instance attributes, etc. The main downside of such ideas is it will take a lot of work to implement.
Specializing code was the root idea of my old FAT Python project: https://faster-cpython.readthedocs.io/fat_python.html I proposed https://www.python.org/dev/peps/pep-0510/ "Specialize functions with guards". Guards are mostly on function arguments, but can also be on namespaces, builtin functions, etc. The https://github.com/vstinner/fatoptimizer project was an implementation of optimizations based on this design. See the documentation: https://github.com/vstinner/fatoptimizer/blob/master/doc/optimizations.rst I failed to implement optimizations which showed significant speedup on real code, and guards were not free in terms of performance. I gave up on this project. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Tue, 22 Sep 2020 17:11:40 -0700 Guido van Rossum <guido@python.org> wrote:
On Tue, Sep 22, 2020 at 4:58 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
What are you trying to achieve by using tagged pointers?
It seems to me that in a dynamic environment like Python, tagged pointer tricks are only ever going to reduce memory usage, not make anything faster, and in fact can only make things slower if applied everywhere.
Hm... mypyc (an experimental Python-to-C compiler bundled with mypy) uses tagged pointers to encode integers up to 63 bits. I think it's done for speed, and it's probably faster in part because it avoids slow memory accesses.
Certainly. But I suspect the statically-typed aspect of mypyc helps it avoid systematic condition checks when accessing those tagged pointers. Which is not the case in a dynamically-typed setting such as Victor's PR. (however, tagged pointers + a runtime specializer could perhaps show more benefits than the current PR) Regards Antoine.
Hi Victor, There are plenty of reasons for a new, cleaner C-API. However, performance isn't really one of them. The C-API is not much of an obstacle to improving the performance of CPython, at the moment. There are implementation details that leak into the API, but they are only an issue when we want to change those details. At which point, we should consider focused changes rather than the vague, sweeping changes suggested in PEP 620. On 21/09/2020 7:35 pm, Victor Stinner wrote:
Hi,
I need to help to attempt to optimize my experimental CPython fork which uses tagged pointers.
When I proposed my PEP 620 "Hide implementation details from the C API", I was asked about a proof that the PEP unlocks real optimization possibilities. So I wrote an implementation of tagged pointers: https://github.com/vstinner/cpython/pull/6
The main benefit is the memory usage. For example, list(range(200)) uses 1656 bytes instead of 7262 (4x less memory).
Sadly, my current simple implementation is 1.1x slower than the reference. I suspect that adding a condition to Py_INCREF() and Py_DECREF() explains a large part of this overhead.
My implementation uses tagged pointers for:
* integers in the range: [-5; 256] * None, True and False singletons
It would be nice to use tagged pointers for a wide range of integer numbers, but I wrote a simple implementation: _Py_TAGPTR_UNBOX() has to return a borrowed reference. This function should return a strong reference to support a larger range.
More information in the document: https://github.com/vstinner/cpython/blob/tagged_ptr/TAGGED_POINTERS.rst
A few suggestions: Make the tagged value something like: typedef struct { intptr_t bits; } PyTaggedValue; which will prevent erroneous casts to and from PyObject *. Add new INCREF/DECREF inline functions that take tagged values. Never return a borrowed reference (you should should know that ;) Why are you tagging None, True and False? They don't take up any space. Abstract out the tagging scheme. You will want to change it. Cheers, Mark.
participants (6)
-
Antoine Pitrou
-
Greg Ewing
-
Guido van Rossum
-
Ivan Levkivskyi
-
Mark Shannon
-
Victor Stinner