[capi-sig]tagged pointer fixed integer experiment
This is an experiment to exercise the new C-API we are trying to design. I would like it to be able to support tagged pointers. This requires PyObject* to be treated as an opaque pointer. Note, I'm not suggesting that we should used tagged pointers in CPython. Instead, I just want to see if the API will make it possible to do it. The logic is that some Python implementations might want to use tagged pointers. They would also like to provide the same C-API as CPython. So, it would be real nice if we don't make things really hard for them. The patch adding the actual tagged pointer type is quite trivial. There are a number of git commits preceeding it in order to make it trivial. Coccinelle has been useful. Some of my semantic patches
@@ expression E; @@
-E->ob_type +Py_TYPE(E)
@@ expression E, F; @@
-Py_TYPE(E) = F +Py_SET_TYPE(E, F)
Run like: spatch --sp-file ob_type.cocci <C source files> To speed things up, I used grep to narrow down the set of files to process. Here is the source code for the fixed int cpython. There is a new builtin 'fixedint'. It doesn't do much yet but at least doesn't immediately crash. https://github.com/nascheme/cpython/tree/tagged_int
I was curious about how much slower CPython will be now that I'm using functions for Py_Type(), Py_INCREF(), etc. Some quick and dirty benchmark seems to show it is not so bad, maybe about 10% slower. Note that I updated the code to use C99 inline functions. They are neat.
After doing the quick and dirty benchmark on the costs, I got curious about what might be the gain of using tagged pointers for small ints. The natural thing to do is to provide a fast-path in the ceval loop (ignoring Victor's warnings). BTW, the discussion in
https://bugs.python.org/issue21955
is quite interesting if you enjoy an epic saga of micro-optimization. I tried implementing a fast-path just for BINARY_ADD of two fixedint numbers.
The result looks promising:
./python -m perf timeit --name='x+y' -s 'x=10000; y=2' 'x+y' --dup 1000 -v -o int.json ./python -m perf timeit --name='x+y' -s 'x=fixedint(10000); y=fixedint(2)' 'x+y' --dup 1000 -v -o fixedint.json ./python -m perf compare_to int.json fixedint.json Mean +- std dev: [int] 32.3 ns +- 1.0 ns -> [fixedint] 10.8 ns +- 0.3 ns: 3.00x faster (-67%)
Maybe this could turn into some kind of carrot to encourage adoption of the new C-API.
Le mer. 12 sept. 2018 à 09:17, Neil Schemenauer <nas-python@arctrix.com> a écrit :
Mean +- std dev: [int] 32.3 ns +- 1.0 ns -> [fixedint] 10.8 ns +- 0.3 ns: 3.00x faster (-67%)
Maybe this could turn into some kind of carrot to encourage adoption of the new C-API.
Amazing! Now I want your code in my http://github.com/pythoncapi/cpython/ fork! Please send me a PR!
Victor
On 2018-09-12, Victor Stinner wrote:
Amazing! Now I want your code in my http://github.com/pythoncapi/cpython/ fork! Please send me a PR!
I tried yesterday to port my changes to your branch. They don't work because of the hiding of Py_TYPE behind the limited API macro. So, porting to your branch is non-trival. It will take me some time. I would like to do more benchmarking as people are already making conclusions based on my bad benchmarks.
The changes to the C-API needed to make this work are relatively simple. Using Py_TYPE is the big one. Source code fixes can be automated (with Coccinelle or some other tool). So, I think this doesn't have to wait for the full C-API redesign to mature. Fixing extensions that use borrowed references is a lot harder.
My dream now is to add a 'configure' option that would turn on tagged pointers for small ints. If you turn that on, extensions accessing PyObject struct members would break. However, it would be a nicely modular option: turn if off and there should be zero performance hit. Turn it on and, from an outside perspective, Python still works as normal. So, type(<fixedint>) would return 'int'. The fixedint stuff would be a hidden implementation detail.
Regards,
Neil
Le mer. 12 sept. 2018 à 18:12, Neil Schemenauer <nas-python@arctrix.com> a écrit :
On 2018-09-12, Victor Stinner wrote:
Amazing! Now I want your code in my http://github.com/pythoncapi/cpython/ fork! Please send me a PR!
I tried yesterday to port my changes to your branch. They don't work because of the hiding of Py_TYPE behind the limited API macro.
You may add a new #ifdef to enable tagged pointer, #ifdef which would enable Py_TYPE again.
So, porting to your branch is non-trival. It will take me some time. I would like to do more benchmarking as people are already making conclusions based on my bad benchmarks.
It's too early to measure the overhead, since we didn't agree yet on the API itself. I suggest to first focus on correctness and the design.
My dream now is to add a 'configure' option that would turn on tagged pointers for small ints. If you turn that on, extensions accessing PyObject struct members would break. However, it would be a nicely modular option: turn if off and there should be zero performance hit. Turn it on and, from an outside perspective, Python still works as normal. So, type(<fixedint>) would return 'int'. The fixedint stuff would be a hidden implementation detail.
In my plan, the "regular Python runtime" will continue to use borrowed references, Py_TYPE() and everything: https://pythoncapi.readthedocs.io/runtimes.html#regular-python-usr-bin-pytho...
But to use tagged pointers, you have to use the "New experimental runtime" where borrowed pointers, Py_TYPE(), etc. would be illegal: https://pythoncapi.readthedocs.io/runtimes.html#new-experimental-runtime-pyt...
Right now, I'm working on the implementation of this "new experimental runtime" to analyze which APIs are "bad" and how far we have to go to reach the perfect API.
Once I will know the full range of required changes and we will have proper benchmark results, we can *start* talking about which changes are worth it or not.
We can also imagine having multiple iterations to enhance the C API. For example, keep Py_TYPE() in the first iteration. But discuss removing it later.
About correctness, yesterday, we discussed adding a new first "ctx" parameter to all functions of the C API to support Eric Snow's Sub-Interpreters. The idea is to avoid any kind of global state, and pass an opaque pointer to pass the state of an interpreter. The state would contain for example memory allocators. At the point, I'm not fully excited by this change, since every C function using the C API would have to pass ctx everywhere.
Victor
On 2018-09-12, Victor Stinner wrote:
About correctness, yesterday, we discussed adding a new first "ctx" parameter to all functions of the C API to support Eric Snow's Sub-Interpreters. [..] At the point, I'm not fully excited by this change, since every C function using the C API would have to pass ctx everywhere.
As I said to you early in the week, you are a brave person to try to tackle this C-API project. Handling the multi-interpreter problem makes harder yet and more invasive. Personally, I think the correct way to do it is to pass ctx as the first parameter to all C-API functions. If it is not a global, I think you have to pass it. Other schemes are suboptimal and we should do the correct thing.
You can have two sets of APIs. One set has ctx as the first argument of every API call. You make a second "easy" set that builds on the first set and uses a global as the ctx argument. This second set could be compatible with the current API. The painful part is that all CPython internals must use the first set of APIs (explicitly pass ctx). Something like Coccinelle could do the hard work.
After you do it, merging between cpython versions becomes essentially impossible since you are changing like 50% of the lines of source code. So, I can imagine that some core developers would resist the change. I think even though it is painful, it is the correct thing to do. If you want to allow Python to be embedded properly (e.g. for game scripting), you have to do it. So, I think we should plan to "bit the bullet" and have a "flag day". I think Python is losing to Lua in these applications because embedding Python doesn't work properly.
Regards,
Neil
On Sep 12, 2018, at 10:39, Neil Schemenauer <nas-python@arctrix.com> wrote:
As I said to you early in the week, you are a brave person to try to tackle this C-API project.
Agreed, and as you allude to, the social aspects are at least as challenging as the technical ones.
You can have two sets of APIs. One set has ctx as the first argument of every API call. You make a second "easy" set that builds on the first set and uses a global as the ctx argument. This second set could be compatible with the current API. The painful part is that all CPython internals must use the first set of APIs (explicitly pass ctx). Something like Coccinelle could do the hard work.
I agree with this general approach as well.
After you do it, merging between cpython versions becomes essentially impossible since you are changing like 50% of the lines of source code. So, I can imagine that some core developers would resist the change. I think even though it is painful, it is the correct thing to do. If you want to allow Python to be embedded properly (e.g. for game scripting), you have to do it. So, I think we should plan to "bit the bullet" and have a "flag day". I think Python is losing to Lua in these applications because embedding Python doesn't work properly.
+1 from me as well to this general plan. I’m hoping that Victor’s port can prove that it’s feasible and give us high confidence that it can eventually lead to improved performance, embeddability, etc. I wonder whether this will be ready for 3.8, and if not, if there is some groundwork we can lay in 3.8 that won’t be as radical, but will make the flag day easier for 3.9 or 4.0 or whatever comes after that.
-Barry
On Wed, 12 Sep 2018 at 11:02 Barry Warsaw <barry@python.org> wrote:
On Sep 12, 2018, at 10:39, Neil Schemenauer <nas-python@arctrix.com> wrote:
As I said to you early in the week, you are a brave person to try to tackle this C-API project.
Agreed, and as you allude to, the social aspects are at least as challenging as the technical ones.
You can have two sets of APIs. One set has ctx as the first argument of every API call. You make a second "easy" set that builds on the first set and uses a global as the ctx argument. This second set could be compatible with the current API. The painful part is that all CPython internals must use the first set of APIs (explicitly pass ctx). Something like Coccinelle could do the hard work.
I agree with this general approach as well.
After you do it, merging between cpython versions becomes essentially impossible since you are changing like 50% of the lines of source code. So, I can imagine that some core developers would resist the change. I think even though it is painful, it is the correct thing to do. If you want to allow Python to be embedded properly (e.g. for game scripting), you have to do it. So, I think we should plan to "bit the bullet" and have a "flag day". I think Python is losing to Lua in these applications because embedding Python doesn't work properly.
+1 from me as well to this general plan. I’m hoping that Victor’s port can prove that it’s feasible and give us high confidence that it can eventually lead to improved performance, embeddability, etc. I wonder whether this will be ready for 3.8, and if not, if there is some groundwork we can lay in 3.8 that won’t be as radical, but will make the flag day easier for 3.9 or 4.0 or whatever comes after that.
I think we should just operate under the assumption that this project is going to be big enough that this is a Py4K project. There's no need to rush and I wouldn't want any release deadline to make people feel rushed.
On 2018-09-12, Brett Cannon wrote:
I think we should just operate under the assumption that this project is going to be big enough that this is a Py4K project. There's no need to rush and I wouldn't want any release deadline to make people feel rushed.
I fear that scope creep is going to kill the initiative. So, I would like to see the API changes related to making alternative implementations work better (PyObject* as opaque pointer, no borrowed refs) become a separate project.
The new API would be the layer two API as dicussed above, without the ctx argument. We would take care to make sure that new API would be compatible with a later project to introduce the multi-interpreter safe API. Then we don't have to scare people about the prospect of changing 200k lines of CPython code to add the ctx argument.
On 2018-09-12, Neil Schemenauer wrote:
Personally, I think the correct way to do it is to pass ctx as the first parameter to all C-API functions. If it is not a global, I think you have to pass it. Other schemes are suboptimal and we should do the correct thing.
After talking to Dino a bit, I'm not sure about this now. It sounds like you could use thread-local storage rather than globals. That would be somewhat more limited but would still, in theory, allow what I was hoping for. The bigger issue seems to be allowing a per-interpreter GIL. Eric Snow is working on this with the multi-core (no GIL-sharing) project and it sounds like he thinks API changes should not be needed.
If my understanding is true, that's great news. The prospect of having to pass ctx everywhere inside CPython internals is quite frightening.
Hi,
While I clearly see an advantage of passing explicitly ctx to all functions of the C API, I also see that as a burden. I understood that it's not only calls to the C API that should be modified, but basically all C functions of CPython and all C functions of all C extensions, since ctx always comes from the parent. You would have to propagate ctx to every single C function which indirectly calls a function of the C API.
It's not easy to justify this burden for an application which uses a single thread and a single interpreter.
So yeah, if it would be possible to make ctx implicit... that would be better :-)
Note: I would be curious to see allocator functions like PyMem_Malloc() and Py_DecodeLocale()/Py_EncodeLocale() functions "ctx-aware", since it's complex to use properly these functions during the Python initialization.
Victor
Le jeu. 13 sept. 2018 à 02:22, Neil Schemenauer <nas-python@arctrix.com> a écrit :
On 2018-09-12, Neil Schemenauer wrote:
Personally, I think the correct way to do it is to pass ctx as the first parameter to all C-API functions. If it is not a global, I think you have to pass it. Other schemes are suboptimal and we should do the correct thing.
After talking to Dino a bit, I'm not sure about this now. It sounds like you could use thread-local storage rather than globals. That would be somewhat more limited but would still, in theory, allow what I was hoping for. The bigger issue seems to be allowing a per-interpreter GIL. Eric Snow is working on this with the multi-core (no GIL-sharing) project and it sounds like he thinks API changes should not be needed.
If my understanding is true, that's great news. The prospect of having to pass ctx everywhere inside CPython internals is quite frightening.
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org
On Thu, 13 Sep 2018 at 16:40, Victor Stinner <vstinner@redhat.com> wrote:
Hi,
While I clearly see an advantage of passing explicitly ctx to all functions of the C API, I also see that as a burden. I understood that it's not only calls to the C API that should be modified, but basically all C functions of CPython and all C functions of all C extensions, since ctx always comes from the parent. You would have to propagate ctx to every single C function which indirectly calls a function of the C API.
Plus you'd somehow have to pass it through any invoked Python code that subsequently calls back in to a C API.
It's not easy to justify this burden for an application which uses a single thread and a single interpreter.
Even without that concern, I don't see a way to handle the C -> Python -> C case without some form of implicit thread-local state management.
So yeah, if it would be possible to make ctx implicit... that would be better :-)
PEP 406 was a more constrained proposal to move just the import system state to an explicit "Import Engine" object, and even in that more limited case we quickly abandoned the idea of requiring that the engine always be passed to import hooks explicitly - it simply required too many changes to already standardised PEP 302 interfaces.
Fortunately, PEP 567 gives us some solid C API friendly context management primitives in the general case, and Eric's work on the subinterpreters API should hopefully finally get us to a point where the EnsureGIL APIs and subinterpreters actually play nice together.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 13/09/2018 à 02:21, Neil Schemenauer a écrit :
On 2018-09-12, Neil Schemenauer wrote:
Personally, I think the correct way to do it is to pass ctx as the first parameter to all C-API functions. If it is not a global, I think you have to pass it. Other schemes are suboptimal and we should do the correct thing.
After talking to Dino a bit, I'm not sure about this now. It sounds like you could use thread-local storage rather than globals. That would be somewhat more limited but would still, in theory, allow what I was hoping for.
We already use thread-local storage for the PyGILState APIs. Nothing new here ;-)
Regards
Antoine.
Le mer. 12 sept. 2018 à 19:39, Neil Schemenauer <nas-python@arctrix.com> a écrit :
After you do it, merging between cpython versions becomes essentially impossible since you are changing like 50% of the lines of source code. So, I can imagine that some core developers would resist the change. I think even though it is painful, it is the correct thing to do. If you want to allow Python to be embedded properly (e.g. for game scripting), you have to do it. So, I think we should plan to "bit the bullet" and have a "flag day". I think Python is losing to Lua in these applications because embedding Python doesn't work properly.
It's really too early to discuss modifying the upstream code.
But you are asking, so let me reply :-) If we find a consensus of the right level of breaking changes, if we agree on performance, if we agree on the way to opt-in for the API, if we have a proper tooling to migrate to the new C API, if we provide tooling to use the new C API on Python 3.7 and older, we can start discussing how to merge these changes into CPython upstream.
My long term plan is to make most changes conditional using a C define, and only enable this flag using a ./configure option. We would provide two runtimes: "python3.8" compiled with default options and "python3.8-exp" (or another more funky name) which enables the new cool optimizations but only supports the new C API.
"python3.8" will continue to support the full unmodified old C API, in addition to supporting the new C API.
Victor
Le 12/09/2018 à 09:16, Neil Schemenauer a écrit :
I was curious about how much slower CPython will be now that I'm using functions for Py_Type(), Py_INCREF(), etc. Some quick and dirty benchmark seems to show it is not so bad, maybe about 10% slower. Note that I updated the code to use C99 inline functions. They are neat.
After doing the quick and dirty benchmark on the costs, I got curious about what might be the gain of using tagged pointers for small ints. The natural thing to do is to provide a fast-path in the ceval loop (ignoring Victor's warnings). BTW, the discussion in
https://bugs.python.org/issue21955
is quite interesting if you enjoy an epic saga of micro-optimization. I tried implementing a fast-path just for BINARY_ADD of two fixedint numbers.
The result looks promising:
./python -m perf timeit --name='x+y' -s 'x=10000; y=2' 'x+y' --dup 1000 -v -o int.json ./python -m perf timeit --name='x+y' -s 'x=fixedint(10000); y=fixedint(2)' 'x+y' --dup 1000 -v -o fixedint.json ./python -m perf compare_to int.json fixedint.json Mean +- std dev: [int] 32.3 ns +- 1.0 ns -> [fixedint] 10.8 ns +- 0.3 ns: 3.00x faster (-67%)
Hmm... so you get a 10% global slowdown, plus a 3x speedup on a silly microbenchmark, and you call that promising? ;-)
Regards
Antoine.
On 12 Sep 2018, at 09:16, Neil Schemenauer <nas-python@arctrix.com> wrote:
I was curious about how much slower CPython will be now that I'm using functions for Py_Type(), Py_INCREF(), etc. Some quick and dirty benchmark seems to show it is not so bad, maybe about 10% slower. Note that I updated the code to use C99 inline functions. They are neat.
A 10% slowdown is pretty bad, and worse than I expected. Have you tested the performance of your branch without tagged pointers but with inline functions?
Ronald
On 2018-09-12, Ronald Oussoren wrote:
A 10% slowdown is pretty bad, and worse than I expected. Have you tested the performance of your branch without tagged pointers but with inline functions?
Don't take the 10% too seriously. Quick and dirty means what it says. Today I will try to use pyperformance to get proper performance numbers. My 10% was based on just making Py_TYPE(), Py_INCREF(), Py_DECREF(), etc into functions, no fixint optimization.
Neil
Neil Schemenauer schrieb am 12.09.2018 um 09:16:
I was curious about how much slower CPython will be now that I'm using functions for Py_Type(), Py_INCREF(), etc. Some quick and dirty benchmark seems to show it is not so bad, maybe about 10% slower.
Well, you'd expect a visible performance hit here. It's a very common thing to quickly check exactly for a couple of likely types via "ob_type" and then do something type specific. If that "quick check" now takes serious effort, than it turns all sorts of prior speedups into expensive operations.
So, if the goal is to make this common operation slow, what alternative do you envison for getting back an "almost no cost" type check?
Stefan
On 2018-09-12, Stefan Behnel wrote:
So, if the goal is to make this common operation slow, what alternative do you envison for getting back an "almost no cost" type check?
Off the top of my head, I think you can have a C99 inline function, like Py_IsType(ob, tp). If you build with tagged pointers turned off, there is no cost compared to what we currently do. If you turn them on, there is an extra check on the low bits of the pointer and a branch. Not sure that qualifies as "almost no cost" but seems not bad to me.
Neil
The worse offender is probably not Py_TYPE() but reference counting.
Just like switching to atomic reference counting slows down Python by 20% to 40% (according to various attempts by various people), it's not difficult to imagine that adding a conditional branch in the critical paths of Py_INCREF() and Py_DECREF() would significantly slow down Python as well.
So, if you want tagged pointers without slowing everything other than small integer arithmetic, you probably need to ditch reference counting as well. And then, perhaps you should start by ditching reference counting, because that's much more ambitious and complicated than implementing tagged pointers ;-)
Regards
Antoine.
Le 12/09/2018 à 23:25, Neil Schemenauer a écrit :
On 2018-09-12, Stefan Behnel wrote:
So, if the goal is to make this common operation slow, what alternative do you envison for getting back an "almost no cost" type check?
Off the top of my head, I think you can have a C99 inline function, like Py_IsType(ob, tp). If you build with tagged pointers turned off, there is no cost compared to what we currently do. If you turn them on, there is an extra check on the low bits of the pointer and a branch. Not sure that qualifies as "almost no cost" but seems not bad to me.
Neil
capi-sig mailing list -- capi-sig@python.org To unsubscribe send an email to capi-sig-leave@python.org
On 2018-09-13, Antoine Pitrou wrote:
[...] it's not difficult to imagine that adding a conditional branch in the critical paths of Py_INCREF() and Py_DECREF() would significantly slow down Python as well.
I finished the benchmarking last night. Hopefully I didn't mess it up as it takes a long time. Making Py_TYPE(), Py_INCREF(), Py_DECREF() into inline functions and adding a conditional branch to check for a tag costs roughly 8%. See below. That's worse than I hoped but not as bad as I feared.
I still hope that actually using tagged fixed ints could recover that 8%. Anything not in the small int cache is doing heap allocation and that must be pretty expensive. Obviously real applications would have to get faster, not just int heavy micro-benchmarks.
So, if you want tagged pointers without slowing everything other than small integer arithmetic, you probably need to ditch reference counting as well. And then, perhaps you should start by ditching reference counting, because that's much more ambitious and complicated than implementing tagged pointers ;-)
I think no one has been hoping to get rid of reference counting in Python for longer than me. My original cycle GC patch started as experiments with mark-and-sweep collection. About 20 years have gone past but maybe we will get there yet.
BTW, the logic for Py_INCREF was:
#define IS_TAGGED(op) ((uint64_t)op & 1)
inline void _Py_INCREF(PyObject *op) { if (!IS_TAGGED(op)) { op->ob_refcnt++; } }
pyperformance results follow.
2to3: Mean +- std dev: [base] 307 ms +- 5 ms -> [funcs] 320 ms +- 2 ms: 1.04x slower (+4%) chameleon: Mean +- std dev: [base] 9.48 ms +- 0.15 ms -> [funcs] 10.2 ms +- 0.6 ms: 1.08x slower (+8%) chaos: Mean +- std dev: [base] 108 ms +- 1 ms -> [funcs] 119 ms +- 1 ms: 1.09x slower (+9%) crypto_pyaes: Mean +- std dev: [base] 112 ms +- 1 ms -> [funcs] 121 ms +- 4 ms: 1.08x slower (+8%) deltablue: Mean +- std dev: [base] 7.17 ms +- 0.22 ms -> [funcs] 7.78 ms +- 0.51 ms: 1.08x slower (+8%) django_template: Mean +- std dev: [base] 122 ms +- 3 ms -> [funcs] 130 ms +- 5 ms: 1.07x slower (+7%) dulwich_log: Mean +- std dev: [base] 76.8 ms +- 0.8 ms -> [funcs] 78.5 ms +- 1.0 ms: 1.02x slower (+2%) fannkuch: Mean +- std dev: [base] 460 ms +- 7 ms -> [funcs] 501 ms +- 2 ms: 1.09x slower (+9%) float: Mean +- std dev: [base] 111 ms +- 2 ms -> [funcs] 121 ms +- 1 ms: 1.09x slower (+9%) genshi_text: Mean +- std dev: [base] 29.3 ms +- 0.5 ms -> [funcs] 30.5 ms +- 1.3 ms: 1.04x slower (+4%) genshi_xml: Mean +- std dev: [base] 62.7 ms +- 0.9 ms -> [funcs] 66.7 ms +- 2.4 ms: 1.06x slower (+6%) go: Mean +- std dev: [base] 247 ms +- 3 ms -> [funcs] 265 ms +- 3 ms: 1.07x slower (+7%) hexiom: Mean +- std dev: [base] 9.93 ms +- 0.58 ms -> [funcs] 10.8 ms +- 0.1 ms: 1.08x slower (+8%) html5lib: Mean +- std dev: [base] 93.3 ms +- 3.2 ms -> [funcs] 97.9 ms +- 3.1 ms: 1.05x slower (+5%) json_dumps: Mean +- std dev: [base] 11.7 ms +- 0.2 ms -> [funcs] 12.4 ms +- 0.4 ms: 1.05x slower (+5%) json_loads: Mean +- std dev: [base] 25.4 us +- 1.4 us -> [funcs] 26.7 us +- 0.5 us: 1.05x slower (+5%) logging_format: Mean +- std dev: [base] 10.1 us +- 0.6 us -> [funcs] 10.6 us +- 0.2 us: 1.05x slower (+5%) logging_silent: Mean +- std dev: [base] 201 ns +- 13 ns -> [funcs] 215 ns +- 6 ns: 1.07x slower (+7%) logging_simple: Mean +- std dev: [base] 9.03 us +- 0.23 us -> [funcs] 9.60 us +- 0.27 us: 1.06x slower (+6%) mako: Mean +- std dev: [base] 17.2 ms +- 0.4 ms -> [funcs] 18.1 ms +- 0.5 ms: 1.05x slower (+5%) meteor_contest: Mean +- std dev: [base] 100 ms +- 2 ms -> [funcs] 104 ms +- 2 ms: 1.04x slower (+4%) nbody: Mean +- std dev: [base] 119 ms +- 4 ms -> [funcs] 134 ms +- 6 ms: 1.12x slower (+12%) nqueens: Mean +- std dev: [base] 94.3 ms +- 1.1 ms -> [funcs] 102 ms +- 2 ms: 1.08x slower (+8%) pathlib: Mean +- std dev: [base] 19.7 ms +- 0.2 ms -> [funcs] 20.4 ms +- 0.2 ms: 1.04x slower (+4%) pickle: Mean +- std dev: [base] 9.08 us +- 0.25 us -> [funcs] 9.25 us +- 0.27 us: 1.02x slower (+2%) pickle_dict: Mean +- std dev: [base] 22.5 us +- 0.2 us -> [funcs] 20.8 us +- 1.0 us: 1.08x faster (-8%) pickle_pure_python: Mean +- std dev: [base] 466 us +- 7 us -> [funcs] 501 us +- 26 us: 1.07x slower (+7%) pidigits: Mean +- std dev: [base] 165 ms +- 1 ms -> [funcs] 170 ms +- 4 ms: 1.03x slower (+3%) python_startup: Mean +- std dev: [base] 7.40 ms +- 0.11 ms -> [funcs] 7.48 ms +- 0.06 ms: 1.01x slower (+1%) python_startup_no_site: Mean +- std dev: [base] 5.10 ms +- 0.03 ms -> [funcs] 5.18 ms +- 0.05 ms: 1.02x slower (+2%) raytrace: Mean +- std dev: [base] 487 ms +- 7 ms -> [funcs] 532 ms +- 11 ms: 1.09x slower (+9%) regex_compile: Mean +- std dev: [base] 182 ms +- 5 ms -> [funcs] 196 ms +- 6 ms: 1.08x slower (+8%) regex_dna: Mean +- std dev: [base] 154 ms +- 2 ms -> [funcs] 157 ms +- 0 ms: 1.02x slower (+2%) regex_v8: Mean +- std dev: [base] 22.1 ms +- 0.6 ms -> [funcs] 22.5 ms +- 0.4 ms: 1.02x slower (+2%) richards: Mean +- std dev: [base] 71.6 ms +- 2.6 ms -> [funcs] 76.7 ms +- 1.9 ms: 1.07x slower (+7%) scimark_fft: Mean +- std dev: [base] 316 ms +- 6 ms -> [funcs] 351 ms +- 2 ms: 1.11x slower (+11%) scimark_lu: Mean +- std dev: [base] 172 ms +- 6 ms -> [funcs] 195 ms +- 7 ms: 1.13x slower (+13%) scimark_monte_carlo: Mean +- std dev: [base] 103 ms +- 3 ms -> [funcs] 116 ms +- 5 ms: 1.13x slower (+13%) scimark_sor: Mean +- std dev: [base] 188 ms +- 6 ms -> [funcs] 209 ms +- 6 ms: 1.11x slower (+11%) scimark_sparse_mat_mult: Mean +- std dev: [base] 3.68 ms +- 0.13 ms -> [funcs] 4.23 ms +- 0.12 ms: 1.15x slower (+15%) spectral_norm: Mean +- std dev: [base] 123 ms +- 5 ms -> [funcs] 143 ms +- 2 ms: 1.16x slower (+16%) sqlalchemy_declarative: Mean +- std dev: [base] 161 ms +- 2 ms -> [funcs] 167 ms +- 4 ms: 1.04x slower (+4%) sqlalchemy_imperative: Mean +- std dev: [base] 30.5 ms +- 0.8 ms -> [funcs] 33.1 ms +- 1.3 ms: 1.09x slower (+9%) sqlite_synth: Mean +- std dev: [base] 2.90 us +- 0.10 us -> [funcs] 3.04 us +- 0.24 us: 1.05x slower (+5%) sympy_expand: Mean +- std dev: [base] 424 ms +- 5 ms -> [funcs] 454 ms +- 14 ms: 1.07x slower (+7%) sympy_integrate: Mean +- std dev: [base] 19.5 ms +- 0.1 ms -> [funcs] 20.7 ms +- 0.7 ms: 1.06x slower (+6%) sympy_sum: Mean +- std dev: [base] 90.7 ms +- 0.8 ms -> [funcs] 96.4 ms +- 3.1 ms: 1.06x slower (+6%) sympy_str: Mean +- std dev: [base] 185 ms +- 2 ms -> [funcs] 199 ms +- 6 ms: 1.08x slower (+8%) telco: Mean +- std dev: [base] 5.98 ms +- 0.21 ms -> [funcs] 6.39 ms +- 0.47 ms: 1.07x slower (+7%) tornado_http: Mean +- std dev: [base] 187 ms +- 4 ms -> [funcs] 193 ms +- 2 ms: 1.03x slower (+3%) unpack_sequence: Mean +- std dev: [base] 47.9 ns +- 1.5 ns -> [funcs] 51.4 ns +- 1.4 ns: 1.07x slower (+7%) unpickle_list: Mean +- std dev: [base] 3.73 us +- 0.04 us -> [funcs] 3.77 us +- 0.07 us: 1.01x slower (+1%) unpickle_pure_python: Mean +- std dev: [base] 371 us +- 7 us -> [funcs] 403 us +- 14 us: 1.09x slower (+9%) xml_etree_parse: Mean +- std dev: [base] 136 ms +- 3 ms -> [funcs] 142 ms +- 5 ms: 1.04x slower (+4%) xml_etree_iterparse: Mean +- std dev: [base] 95.6 ms +- 0.9 ms -> [funcs] 101 ms +- 2 ms: 1.06x slower (+6%) xml_etree_generate: Mean +- std dev: [base] 105 ms +- 4 ms -> [funcs] 112 ms +- 1 ms: 1.07x slower (+7%) xml_etree_process: Mean +- std dev: [base] 84.3 ms +- 3.6 ms -> [funcs] 89.8 ms +- 1.0 ms: 1.07x slower (+7%) Benchmark hidden because not significant (3): pickle_list, regex_effbot, unpickle
On 2018-09-13, Neil Schemenauer wrote:
Making Py_TYPE(), Py_INCREF(), Py_DECREF() into inline functions and adding a conditional branch to check for a tag costs roughly 8%.
I've been pondering this result. It seems surprising that such a small amount of new instructions (obviously on a super hot path) would cause such a slow-down. The disassembled code for a call to Py_INCREF in listmodule.c is below:
1ff: 48 8b 1e mov (%rsi),%rbx
if (_Py_IsTaggedPtr(op)) {
202: f6 c3 01 test $0x1,%bl
205: 0f 85 00 00 00 00 jne 20b <PyList_AsTuple+0xdb>
((PyObject *)(op))->ob_refcnt++);
20b: 48 83 03 01 addq $0x1,(%rbx)
The extra instructions that the tagging adds is the "test" and "jne". I compiled with PGO so branches should be setup to best use likely/unlikey branch paths.
The cycles for the memory write are more difficult to estimate. If we assume the refcnt is in L2 cache on a Haswell processor, the latency is 12 cycles. For L1, 4 cycles.
So, the fact that these extra two instructions add about 8% overhead is an interesting result. I think it means that Py_INCREF and Py_DECREF represent a huge amount of the CPU cycles used by real programs. Making a non-refcount GC has all kinds of challenges. But, there must be a lot of overhead that can be removed by removing INCREF/DECREF.
I really want a revised C-API that allows a non-refcount core GC with recounting for "handles" passed to extensions. It opens the door to someone in the future making a better GC. With the API we have today, it can't happen without either breaking most C extensions or at least taking a huge performance hit for them.
Regards,
Neil
participants (8)
-
Antoine Pitrou
-
Barry Warsaw
-
Brett Cannon
-
Neil Schemenauer
-
Nick Coghlan
-
Ronald Oussoren
-
Stefan Behnel
-
Victor Stinner