Hi Petr, Thanks for spending time on this. I think the comparison of the two PEPs falls into two broad categories, performance and capability. I'll address capability first. Let's try a thought experiment. Consider PEP 580. It uses the old `tp_print` slot as an offset to mark the location of the CCall structure within the callable. Now suppose instead that it uses a `tp_flag` to mark the presence of an offset field and that the offset field is moved to the end of the TypeObject. This would not impact the capabilities of PEP 580. Now add a single line nargs ~= PY_VECTORCALL_ARGUMENTS_OFFSET here https://github.com/python/cpython/compare/master...jdemeyer:pep580#diff-1160... which would make PyCCall_FastCall compatible with the PEP 590 vectorcall protocol. Now rebase the PEP 580 reference code on top of PEP 590 minimal implementation and make the vectorcall field of CFunction point to PyCCall_FastCall. The resulting hybrid is both a PEP 590 conformant implementation, and is at least as capable as the reference PEP 580 implementation. Therefore PEP 590, must be at least as capable at PEP 580. Now performance. Currently the PEP 590 implementation is intentionally minimal. It does nothing for performance. The benchmark Jeroen provides is a micro-benchmark that calls the same functions repeatedly. This is trivial and unrealistic. So, there is no real evidence either way. I will try to provide some. The point of PEP 590 is that it allows performance improvements by allowing callables more freedom of implementation. To repeat an example from an earlier email, which may have been overlooked, this code reduces the time to create ranges and small lists by about 30% https://github.com/markshannon/cpython/compare/vectorcall-minimal...markshan... https://gist.github.com/markshannon/5cef3a74369391f6ef937d52cca9bfc8 To speed up calls to builtin functions by a measurable amount will need some work on argument clinic. I plan to have that done before PyCon in May. Cheers, Mark.
On 2019-04-14 13:34, Mark Shannon wrote:
I'll address capability first.
I don't think that comparing "capability" makes a lot of sense since neither PEP 580 nor PEP 590 adds any new capabilities to CPython. They are meant to allow doing things faster, not to allow more things. And yes, the C call protocol can be implemented on top of the vectorcall protocol and conversely, but that doesn't mean much.
Now performance.
Currently the PEP 590 implementation is intentionally minimal. It does nothing for performance.
So, we're missing some information here. What kind of performance improvements are possible with PEP 590 which are not in the reference implementation?
The benchmark Jeroen provides is a micro-benchmark that calls the same functions repeatedly. This is trivial and unrealistic.
Well, it depends what you want to measure... I'm trying to measure precisely the thing that makes PEP 580 and PEP 590 different from the status-quo, so in that sense those benchmarks are very relevant. I think that the following 3 statements are objectively true: (A) Both PEP 580 and PEP 590 add a new calling convention, which is equally fast as builtin functions (and hence faster than tp_call). (B) Both PEP 580 and PEP 590 keep roughly the same performance as the status-quo for existing function/method calls. (C) While the performance of PEP 580 and PEP 590 is roughly the same, PEP 580 is slightly faster (based on the reference implementations linked from PEP 580 and PEP 590). Two caveats concerning (C): - the difference may be too small to matter. Relatively, it's a few percent of the call time but in absolute numbers, it's less than 10 CPU clock cycles. - there might be possible improvements to the reference implementation of either PEP 580/PEP 590. I don't expect big differences though.
To repeat an example from an earlier email, which may have been overlooked, this code reduces the time to create ranges and small lists by about 30%
That's just a special case of the general fact (A) above and using the new calling convention for "type". It's an argument in favor of both PEP 580 and PEP 590, not for PEP 590 specifically. Jeroen.
Hi, On 15/04/2019 9:34 am, Jeroen Demeyer wrote:
On 2019-04-14 13:34, Mark Shannon wrote:
I'll address capability first.
I don't think that comparing "capability" makes a lot of sense since neither PEP 580 nor PEP 590 adds any new capabilities to CPython. They are meant to allow doing things faster, not to allow more things.
And yes, the C call protocol can be implemented on top of the vectorcall protocol and conversely, but that doesn't mean much.
That isn't true. You cannot implement PEP 590 on top of PEP 580. PEP 580 isn't as general. Specifically, and this is important, PEP 580 cannot implement efficient calls to class objects without breaking the ABI.
Now performance.
Currently the PEP 590 implementation is intentionally minimal. It does nothing for performance.
So, we're missing some information here. What kind of performance improvements are possible with PEP 590 which are not in the reference implementation?
Performance improvements include, but aren't limited to: 1. Much faster calls to common classes: range(), set(), type(), list(), etc. 2. Modifying argument clinic to produce C functions compatible with the vectorcall, allowing the interpreter to call the C function directly, with no additional overhead beyond the vectorcall call sequence. 3. Customization of the C code for function objects depending on the Python code. The would probably be limited to treating closures and generator function differently, but optimizing other aspects of the Python function call is possible.
The benchmark Jeroen provides is a micro-benchmark that calls the same functions repeatedly. This is trivial and unrealistic.
Well, it depends what you want to measure... I'm trying to measure precisely the thing that makes PEP 580 and PEP 590 different from the status-quo, so in that sense those benchmarks are very relevant.
I think that the following 3 statements are objectively true:
(A) Both PEP 580 and PEP 590 add a new calling convention, which is equally fast as builtin functions (and hence faster than tp_call).
Yes
(B) Both PEP 580 and PEP 590 keep roughly the same performance as the status-quo for existing function/method calls. For the minimal implementation of PEP 590, yes. I would expect a small improvement with and implementation of PEP 590 including optimizations.
(C) While the performance of PEP 580 and PEP 590 is roughly the same, PEP 580 is slightly faster (based on the reference implementations linked from PEP 580 and PEP 590)I quite deliberately used the term "minimal" to describe the implementation of PEP 590 you have been using. PEP 590 allows many optimizations. Comparing the performance of the four hundred line minimal diff for PEP 590 with the full four thousand line diff for PEP 580 is misleading.
Two caveats concerning (C): - the difference may be too small to matter. Relatively, it's a few percent of the call time but in absolute numbers, it's less than 10 CPU clock cycles. - there might be possible improvements to the reference implementation of either PEP 580/PEP 590. I don't expect big differences though.
To repeat an example from an earlier email, which may have been overlooked, this code reduces the time to create ranges and small lists by about 30%
That's just a special case of the general fact (A) above and using the new calling convention for "type". It's an argument in favor of both PEP 580 and PEP 590, not for PEP 590 specifically.
It very much is an argument in favor of PEP 590. PEP 580 cannot do this. Cheers, Mark.
On 2019-04-27 11:26, Mark Shannon wrote:
Specifically, and this is important, PEP 580 cannot implement efficient calls to class objects without breaking the ABI.
First of all, the layout of PyTypeObject isn't actually part of the stable ABI (see PEP 384). So, we wouldn't be breaking anything by extending PyTypeObject. Second, even if you don't buy this argument and you really think that we should guarantee ABI-compatibility, we can still solve that in PEP 580 by special-casing instances of "type". Sure, that's an annoyance but it's not a fundamental obstruction. Jeroen.
On 2019-04-27 11:26, Mark Shannon wrote:
Performance improvements include, but aren't limited to:
1. Much faster calls to common classes: range(), set(), type(), list(), etc.
That's not specific to PEP 590. It can be done with any proposal. I know that there is the ABI issue with PEP 580, but that's not such a big problem as you seem to think (see my last e-mail).
2. Modifying argument clinic to produce C functions compatible with the vectorcall, allowing the interpreter to call the C function directly, with no additional overhead beyond the vectorcall call sequence.
This is a very good point. Doing this will certainly reduce the overhead of PEP 590 over PEP 580.
3. Customization of the C code for function objects depending on the Python code. The would probably be limited to treating closures and generator function differently, but optimizing other aspects of the Python function call is possible.
I'm not entirely sure what you mean, but I'm pretty sure that it's not specific to PEP 590. Jeroen.
participants (2)
-
Jeroen Demeyer
-
Mark Shannon