BDFL-Delegate appointments for several PEPs

Hi folks, With the revised PEP 1 published, the Steering Council members have been working through the backlog of open PEPs, figuring out which ones are at a stage of maturity where we think it makes sense to appoint a BDFL-Delegate to continue moving the PEP through the review process, and eventually make the final decision on whether or not to accept or reject the change. We'll be announcing those appointments as we go, so I'm happy to report that I will be handling the BDFL-Delegate responsibilities for the following PEPs: * PEP 499: Binding "-m" executed modules under their module name as well as `__main__` * PEP 574: Pickle protocol 5 with out of band data I'm also pleased to report that Petr Viktorin has agreed to take on the responsibility of reviewing the competing proposals to improve the way CPython's C API exposes callables for direct invocation by third party low level code: * PEP 576: Exposing the internal FastCallKeywords convention to 3rd party modules * PEP 580: Revising the callable struct hierarchy internally and in the public C API * PEP 579: Background information for the problems the other two PEPs are attempting to address Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Petr, Regarding PEPs 576 and 580. Over the new year, I did a thorough analysis of possible approaches to possible calling conventions for use in the CPython ecosystems and came up with a new PEP. The draft can be found here: https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst I was hoping to profile a branch with the various experimental changes cherry-picked together, but don't seemed to have found the time :( I'd like to have a testable branch, before formally submitting the PEP, but I'd thought you should be aware of the PEP. Cheers, Mark. On 24/03/2019 12:21 pm, Nick Coghlan wrote:

On 2019-03-24 16:22, Mark Shannon wrote:
Thanks for that. Is this new PEP meant to supersede PEP 576?
I'd like to have a testable branch, before formally submitting the PEP, but I'd thought you should be aware of the PEP.
If you want to bring up this PEP now during the PEP 576/580 discussion, maybe it's best to formally submit it now? Having an official PEP number might simplify the discussion. If it turns out to be a bad idea after all, you can still withdraw it. In the mean time, I remind you that PEP 576 also doesn't have a complete reference implementation (the PEP links to a "reference implementation" but it doesn't correspond to the text of the PEP). Jeroen.

On 2019-03-24 16:22, Mark Shannon wrote:
The draft can be found here: https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst
I think that this is basically a better version of PEP 576. The idea is the same as PEP 576, but the details are better. Since it's not fundamentally different from PEP 576, I think that this comparison still stands: https://mail.python.org/pipermail/python-dev/2018-July/154238.html

On Sun, Mar 24, 2019 at 4:22 PM Mark Shannon <mark@hotpy.org> wrote:
Hello Mark, Thank you for letting me know! I wish I knew of this back in January, when you committed the first draft. This is unfair to the competing PEP, which is ready and was waiting for the new govenance. We have lost three months that could be spent pondering the ideas in the pre-PEP. Do you think you will find the time to piece things together? Is there anything that you already know should be changed? Do you have any comments on [Jeroen's comparison]? The pre-PEP is simpler then PEP 580, because it solves simpler issues. I'll need to confirm that it won't paint us into a corner -- that there's a way to address all the issues in PEP 579 in the future. The pre-PEP claims speedups of 2% in initial experiments, with expected overall performance gain of 4% for the standard benchmark suite. That's pretty big. As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython). The pre-PEP has a complication around offsetting arguments by 1 to allow bound methods forward calls cheaply. I fear that this optimizes for current usage with its limitations. PEP 580's cc_parent allows bound methods to have access to the class, and through that, the module object where they are defined and the corresponding module state. To support this, vector calls would need a two-argument offset. (That seems to illustrate the main difference between the motivations of the two PEPs: one focuses on extensibility; the other on optimizing existing use cases.) The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting. [Jeroen's comparison]: https://mail.python.org/pipermail/python-dev/2018-July/154238.html

By lack of a better name, I'm using the name PEP 576bis to refer to https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst (This is why this should get a PEP number soon, even if the PEP is not completely done yet). On 2019-03-27 14:50, Petr Viktorin wrote:
One potential issue is calling bound methods (in the duck typing sense) when the LOAD_METHOD optimization is *not* used. This would happen for example when storing a bound method object somewhere and then calling it (possibly repeatedly). Perhaps that's not a very common thing and we should just live with that. However, since __self__ is part of the PEP 580 protocol, it allows calling a bound method object without any performance penalty compared to calling the underlying function directly. Similarly, a follow-up of PEP 580 could allow zero-overhead calling of static/class methods (I didn't put this in PEP 580 because it's already too long).
As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython).
Cython is indeed the main reason for PEP 580.
The pre-PEP has a complication around offsetting arguments by 1 to allow bound methods forward calls cheaply.
I honestly don't understand what this "offset by one" means or why it's useful. It should be better explained in the PEP.
The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting.
I agree, this is pretty bad. However, I don't think that there is a need for this limitation. PEP 580 solves this by only inheriting the Py_TPFLAGS_HAVE_CCALL flag in specific cases. PEP 576bis could do something similar. Finally, I don't agree with this sentence from PEP 576bis: PEP 580 is specifically targetted at function-like objects, and doesn't support other callables like classes, partial functions, or proxies. It's true that classes are not supported (and I wonder how PEP 576bis deals with that, it would be good to explain that more explicitly) but other callables are not a problem. Jeroen.

On 2019-03-27 14:50, Petr Viktorin wrote:
I re-did my earlier benchmarks for PEP 580 and these are the results: https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248 In general, the PEP 580 timings seem slightly better than vanilla CPython, similar to what Mark got. I'm speculating that the speedup in both cases comes from the removal of type checks and dispatching depending on that, and instead using a single protocol that directly does what needs to be done. Jeroen.

Hi Petr, On 27/03/2019 1:50 pm, Petr Viktorin wrote:
I realize this is less than ideal. I had planned to publish this in December, but life intervened. Nothing bad, just too busy.
Do you think you will find the time to piece things together? Is there anything that you already know should be changed?
I've submitted the final PEP and minimal implementation https://github.com/python/peps/pull/960 https://github.com/python/cpython/compare/master...markshannon:vectorcall-mi...
Do you have any comments on [Jeroen's comparison]?
It is rather out of date, but two comments. 1. `_PyObject_FastCallKeywords()` is used as an example of a call in CPython. It is an internal implementation detail and not a common path. 2. The claim that PEP 580 allows "certain optimizations because other code can make assumptions" is flawed. In general, the caller cannot make assumptions about the callee or vice-versa. Python is a dynamic language.
The pre-PEP is simpler then PEP 580, because it solves simpler issues.
The fundamental issue being addressed is the same, and it is this: Currently third-party C code can either be called quickly or have access to the callable object, not both. Both PEPs address this.
I'll need to confirm that it won't paint us into a corner -- that there's a way to address all the issues in PEP 579 in the future.
PEP 579 is mainly a list of supposed flaws with the 'builtin_function_or_method' class. The general thrust of PEP 579 seems to be that builtin-functions and builtin-methods should be more flexible and extensible than they are. I don't agree. If you want different behaviour, then use a different object. Don't try an cram all this extra behaviour into a pre-existing object. However, if we assume that we are talking about callables implemented in C, in general, then there are 3 key issues covered by PEP 579. 1. Inspection and documentation; it is hard for extensions to have docstrings and signatures. Worth addressing, but completely orthogonal to PEP 590. 2. Extensibility and performance; extensions should have the power of Python functions without suffering slow calls. Allowing the C code access to the callable object is a general solution to this problem. Both PEP 580 and PEP 590 do this. 3. Exposing the underlying implementation and signature of the C code, so that optimisers can avoid unnecessary boxing. This may be worth doing, but until we have an adaptive optimiser capable of exploiting this information, this is premature. Neither PEP 580 nor PEP 590 explicit allow or prevent this.
That's because there is a lot of code around calls in CPython, and it has grown in a rather haphazard fashion. Victor's work to add the "FASTCALL" protocol has helped. PEP 590 seeks to formalise and extend that, so that it can be used more consistently and efficiently.
As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython).
Calls to and from extension code are slow because they have to use the `tp_call` calling convention (or lose access to the callable object). With a calling convention that does not have any special cases, extensions can be as fast as builtin functions. Both PEP 580 and PEP 590 attempt to do this, but PEP 590 is more efficient.
It's optimising for the common case, while allowing the less common. Bound methods and classes need to add one additional argument. Other rarer cases, like `partial` may need to allocate memory, but can still add or remove any number of arguments.
Not true. The first argument in the vector call is the callable itself. Through that it, any callable can access its class, its module or any other object it wants.
I'll reiterate that PEP 590 is more general than PEP 580 and that once the callable's code has access to the callable object (as both PEPs allow) then anything is possible. You can't can get more extensible than that.
The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting.
PEP 580 has the same limitation for the same reasons. The limitation is necessary for correctness if an object supports calls via `__call__` and through another calling convention.
[Jeroen's comparison]: https://mail.python.org/pipermail/python-dev/2018-July/154238.html
Cheers, Mark.

On 2019-03-30 17:30, Mark Shannon wrote:
PEP 580 is meant for extension classes, not Python classes. Extension classes are not dynamic. When you implement tp_call in a given way, the user cannot change it. So if a class implements the C call protocol or the vectorcall protocol, callers can make assumptions about what that means.
I think that there is a misunderstanding here. I fully agree with the "use a different object" solution. This isn't a new solution: it's already possible to implement those different objects (Cython does it). It's just that this solution comes at a performance cost and that's what we want to avoid.
I would argue the opposite: PEP 590 defines a fixed protocol that is not easy to extend. PEP 580 on the other hand uses a new data structure PyCCallDef which could easily be extended in the future (this will intentionally never be part of the stable ABI, so we can do that). I have also argued before that the generality of PEP 590 is a bad thing rather than a good thing: by defining a more rigid protocol as in PEP 580, more optimizations are possible.
I don't think that this limitation is needed in either PEP. As I explained at the top of this email, it can easily be solved by not using the protocol for Python classes. What is wrong with my proposal in PEP 580: https://www.python.org/dev/peps/pep-0580/#inheritance Jeroen.

I added benchmarks for PEP 590: https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248

Hi, On 01/04/2019 6:31 am, Jeroen Demeyer wrote:
I added benchmarks for PEP 590:
https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248
Thanks. As expected for calls to C function for both PEPs and master perform about the same, as they are using almost the same calling convention under the hood. As an example of the advantage that a general fast calling convention gives you, I have implemented the vectorcall versions of list() and range() https://github.com/markshannon/cpython/compare/vectorcall-minimal...markshan... Which gives a roughly 30% reduction in time for creating ranges, or lists from small tuples. https://gist.github.com/markshannon/5cef3a74369391f6ef937d52cca9bfc8 Cheers, Mark.

On 2019-04-02 21:38, Mark Shannon wrote:
While they are "about the same", in general PEP 580 is slightly faster than master and PEP 590. And PEP 590 actually has a minor slow-down for METH_VARARGS calls. I think that this happens because PEP 580 has less levels of indirection than PEP 590. The vectorcall protocol (PEP 590) changes a slower level (tp_call) by a faster level (vectorcall), while PEP 580 just removes that level entirely: it calls the C function directly. This shows that PEP 580 is really meant to have maximal performance in all cases, accidentally even making existing code faster. Jeroen.

On 3/30/19 11:36 PM, Jeroen Demeyer wrote:
It does seem like there is some misunderstanding. PEP 580 defines a CCall structure, which includes the function pointer, flags, "self" and "parent". Like the current implementation, it has various METH_ flags for various C signatures. When called, the info from CCall is matched up (in relatively complex ways) to what the C function expects. PEP 590 only adds the "vectorcall". It does away with flags and only has one C signatures, which is designed to fit all the existing ones, and is well optimized. Storing the "self"/"parent", and making sure they're passed to the C function is the responsibility of the callable object. There's an optimization for "self" (offsetting using PY_VECTORCALL_ARGUMENTS_OFFSET), and any supporting info can be provided as part of "self".
Anything is possible, but if one of the possibilities becomes common and useful, PEP 590 would make it hard to optimize for it. Python has grown many "METH_*" signatures over the years as we found more things that need to be passed to callables. Why would "METH_VECTORCALL" be the last? If it won't (if you think about it as one more way to call functions), then dedicating a tp_* slot to it sounds quite expensive. In one of the ways to call C functions in PEP 580, the function gets access to: - the arguments, - "self", the object - the class that the method was found in (which is not necessarily type(self)) I still have to read the details, but when combined with LOAD_METHOD/CALL_METHOD optimization (avoiding creation of a "bound method" object), it seems impossible to do this efficiently with just the callable's code and callable's object.
I'll add Jeroen's notes from the review of the proposed PEP 590 (https://github.com/python/peps/pull/960): The statement "PEP 580 is specifically targetted at function-like objects, and doesn't support other callables like classes, partial functions, or proxies" is factually false. The motivation for PEP 580 is certainly function/method-like objects but it's a general protocol that every class can implement. For certain classes, it may not be easy or desirable to do that but it's always possible. Given that `PY_METHOD_DESCRIPTOR` is a flag for tp_flags, shouldn't it be called `Py_TPFLAGS_METHOD_DESCRIPTOR` or something? Py_TPFLAGS_HAVE_VECTOR_CALL should be Py_TPFLAGS_HAVE_VECTORCALL, to be consistent with tp_vectorcall_offset and other uses of "vectorcall" (not "vector call") And mine, so far: I'm not clear on the constness of the "args" array. If it is mutable (PyObject **), you can't, for example, directly pass a tuple's storage (or any other array that could be used in the call). If it is not (PyObject * const *), you can't insert the "self" argument in. The reference implementations seems to be inconsistent here. What's the intention?

Hi, On 02/04/2019 1:49 pm, Petr Viktorin wrote:
I doubt METH_VECTORCALL will be the last. Let me give you an example: It is quite common for a function to take two arguments, so we might want add a METH_OO flag for builtin-functions with 2 parameters. To support this in PEP 590, you would make exactly the same change as you would now; which is to add another case to the switch statement in _PyCFunction_FastCallKeywords. For PEP 580, you would add another case to the switch in PyCCall_FastCall. No difference really. PEP 580 uses a slot as well. It's only 8 bytes per class.
It is possible, and relatively straightforward. Why do you think it is impossible?
Thanks for the comments, I'll update the PEP when I get the chance.
I'll make it clearer in the PEP. My thinking was that if `PY_VECTORCALL_ARGUMENTS_OFFSET` is set then the caller is allowing the callee to mutate element -1. It would make sense to generalise that to any element of the vector (including -1). When passing the contents of a tuple, `PY_VECTORCALL_ARGUMENTS_OFFSET` should not be set, and thus the vector could not be mutated. Cheers, Mark.

Access to the class isn't possible currently and also not with PEP 590. But it's easy enough to fix that: PEP 573 adds a new METH_METHOD flag to change the signature of the C function (not the vectorcall wrapper). PEP 580 supports this "out of the box" because I'm reusing the class also to do type checks. But this shouldn't be an argument for or against either PEP.

As I'm reading the PEP 590 reference implementation, it strikes me how similar it is to https://bugs.python.org/issue29259 The main difference is that bpo-29259 has a per-class pointer tp_fastcall instead of a per-object pointer. But actually, the PEP 590 reference implementation does not make much use of the per-object pointer: for all classes except "type", the vectorcall wrapper is the same for all objects of a given type. One thing that bpo-29259 did not realize is that existing optimizations could be dropped in favor of using tp_fastcall. For example, bpo-29259 has code like if (PyFunction_Check(callable)) { return _PyFunction_FastCallKeywords(...); } if (PyCFunction_Check(callable)) { return _PyCFunction_FastCallKeywords(...); } else if (PyType_HasFeature(..., Py_TPFLAGS_HAVE_FASTCALL) ...) but the first 2 branches are superfluous given the third. Anyway, this is just putting PEP 590 a bit in perspective. It doesn't say anything about the merits of PEP 590. Jeroen.

On 2019-04-03 07:33, Jeroen Demeyer wrote:
Actually, in the answer above I only considered "is implementing PEP 573 possible?" but I did not consider the complexity of doing that. And in line with what I claimed about complexity before, I think that PEP 580 scores better in this regard. Take PEP 580 and assume for the sake of argument that it didn't already have the cc_parent field. Then adding support for PEP 573 is easy: just add the cc_parent field to the C call protocol structure and set that field when initializing a method_descriptor. C functions can use the METH_DEFARG flag to get access to the PyCCallDef structure, which gives cc_parent. Implementing PEP 573 for a custom function class takes no extra effort: it doesn't require any changes to that class, except for correctly initializing the cc_parent field. Since PEP 580 has built-in support for methods, nothing special needs to be done to support methods too. With PEP 590 on the other hand, every single class which is involved in PEP 573 must be changed and every single vectorcall wrapper supporting PEP 573 must be changed. This is not limited to the function class itself, also the corresponding method class (for example, builtin_function_or_method for method_descriptor) needs to be changed. Jeroen

Hello! I've had time for a more thorough reading of PEP 590 and the reference implementation. Thank you for the work! Overall, I like PEP 590's direction. I'd now describe the fundamental difference between PEP 580 and PEP 590 as: - PEP 580 tries to optimize all existing calling conventions - PEP 590 tries to optimize (and expose) the most general calling convention (i.e. fastcall) PEP 580 also does a number of other things, as listed in PEP 579. But I think PEP 590 does not block future PEPs for the other items. On the other hand, PEP 580 has a much more mature implementation -- and that's where it picked up real-world complexity. PEP 590's METH_VECTORCALL is designed to handle all existing use cases, rather than mirroring the existing METH_* varieties. But both PEPs require the callable's code to be modified, so requiring it to switch calling conventions shouldn't be a problem. Jeroen's analysis from https://mail.python.org/pipermail/python-dev/2018-July/154238.html seems to miss a step at the top: a. CALL_FUNCTION* / CALL_METHOD opcode calls b. _PyObject_FastCallKeywords() which calls c. _PyCFunction_FastCallKeywords() which calls d. _PyMethodDef_RawFastCallKeywords() which calls e. the actual C function (*ml_meth)() I think it's more useful to say that both PEPs bridge a->e (via _Py_VectorCall or PyCCall_Call). PEP 590 is built on a simple idea, formalizing fastcall. But it is complicated by PY_VECTORCALL_ARGUMENTS_OFFSET and Py_TPFLAGS_METHOD_DESCRIPTOR. As far as I understand, both are there to avoid intermediate bound-method object for LOAD_METHOD/CALL_METHOD. (They do try to be general, but I don't see any other use case.) Is that right? (I'm running out of time today, but I'll write more on why I'm asking, and on the case I called "impossible" (while avoiding creation of a "bound method" object), later.) The way `const` is handled in the function signatures strikes me as too fragile for public API. I'd like if, as much as possible, PY_VECTORCALL_ARGUMENTS_OFFSET was treated as a special optimization that extension authors can either opt in to, or blissfully ignore. That might mean: - vectorcall, PyObject_VectorCallWithCallable, PyObject_VectorCall, PyCall_MakeTpCall all formally take "PyObject *const *args" - a naïve callee must do "nargs &= ~PY_VECTORCALL_ARGUMENTS_OFFSET" (maybe spelled as "nargs &= PY_VECTORCALL_NARGS_MASK"), but otherwise writes compiler-enforced const-correct code. - if PY_VECTORCALL_ARGUMENTS_OFFSET is set, the callee may modify "args[-1]" (and only that, and after the author has read the docs). Another point I'd like some discussion on is that vectorcall function pointer is per-instance. It looks this is only useful for type objects, but it will add a pointer to every new-style callable object (including functions). That seems wasteful. Why not have a per-type pointer, and for types that need it (like PyTypeObject), make it dispatch to an instance-specific function? Minor things: - "Continued prohibition of callable classes as base classes" -- this section reads as a final. Would you be OK wording this as something other PEPs can tackle? - "PyObject_VectorCall" -- this looks extraneous, and the reference imlementation doesn't need it so far. Can it be removed, or justified? - METH_VECTORCALL is *not* strictly "equivalent to the currently undocumented METH_FASTCALL | METH_KEYWORD flags" (it has the ARGUMENTS_OFFSET complication). - I'd like to officially call this PEP "Vectorcall", see https://github.com/python/peps/pull/984 Mark, what are your plans for next steps with PEP 590? If a volunteer wanted to help you push this forward, what would be the best thing to work on? Jeroen, is there something in PEPs 579/580 that PEP 590 blocks, or should address?

On 2019-04-10 18:25, Petr Viktorin wrote:
And thank you for the review!
And PEP 580 has better performance overall, even for METH_FASTCALL. See this thread: https://mail.python.org/pipermail/python-dev/2019-April/156954.html Since these PEPs are all about performance, I consider this a very relevant argument in favor of PEP 580.
I claim that the complexity in the protocol of PEP 580 is a good thing, as it removes complexity from other places, in particular from the users of the protocol (better have a complex protocol that's simple to use, rather than a simple protocol that's complex to use). As a more concrete example of the simplicity that PEP 580 could bring, CPython currently has 2 classes for bound methods implemented in C: - "builtin_function_or_method" for normal C methods - "method-descriptor" for slot wrappers like __eq__ or __add__ With PEP 590, these classes would need to stay separate to get maximal performance. With PEP 580, just one class for bound methods would be sufficient and there wouldn't be any performance loss. And this extends to custom third-party function/method classes, for example as implemented by Cython.
Agreed.
Not quite. For a builtin_function_or_method, we have with PEP 580: a. call_function() calls d. PyCCall_FastCall which calls e. the actual C function and with PEP 590 it's more like: a. call_function() calls c. _PyCFunction_FastCallKeywords which calls d. _PyMethodDef_RawFastCallKeywords which calls e. the actual C function Level c. above is the vectorcall wrapper, which is a level that PEP 580 doesn't have.
The way `const` is handled in the function signatures strikes me as too fragile for public API.
That's a detail which shouldn't influence the acceptance of either PEP.
Why not have a per-type pointer, and for types that need it (like PyTypeObject), make it dispatch to an instance-specific function?
That would be exactly https://bugs.python.org/issue29259 I'll let Mark comment on this.
Those are indeed details which shouldn't influence the acceptance of either PEP. If you go with PEP 590, then we should discuss this further.
Personally, I think what we need now is a decision between PEP 580 and PEP 590 (there is still the possibility of rejecting both but I really hope that this won't happen). There is a lot of work that still needs to be done after either PEP is accepted, such as: - finish and merge the reference implementation - document everything - use the protocol in more classes where it makes sense (for example, staticmethod, wrapper_descriptor) - use this in Cython - handle more issues from PEP 579 I volunteer to put my time into this, regardless of which PEP is accepted. Of course, I still think that PEP 580 is better, but I also want this functionality even if PEP 590 is accepted.
Jeroen, is there something in PEPs 579/580 that PEP 590 blocks, or should address?
Well, PEP 580 is an extensible protocol while PEP 590 is not. But, PyTypeObject is extensible, so even with PEP 590 one can always extend that (for example, PEP 590 uses a type flag Py_TPFLAGS_METHOD_DESCRIPTOR where PEP 580 instead uses the structs for the C call protocol). But I guess that extending PyTypeObject will be harder to justify (say, in a future PEP) than extending the C call protocol. Also, it's explicitly allowed for users of the PEP 580 protocol to extend the PyCCallDef structure with custom fields. But I don't have a concrete idea of whether that will be useful. Kind regards, Jeroen.

On 4/11/19 1:05 AM, Jeroen Demeyer wrote:
One general note: I am not (yet) choosing between PEP 580 and PEP 590. I am not looking for arguments for/against whole PEPs, but individual ideas which, I believe, can still be mixed & matched. I see the situation this way: - I get about one day per week when I can properly concentrate on CPython. It's frustrating to be the bottleneck. - Jeroen has time, but it would frustrating to work on something that will later be discarded, and it's frustrating to not be able to move the project forward. - Mark has good ideas, but seems to lack the time to polish them, or even test out if they are good. It is probably frustrating to see unpolished ideas rejected. I'm looking for ways to reduce the frustration, given where we are. Jeroen, thank you for the comments. Apologies for not having the time to reply to all of them properly right now. Mark, if you could find the time to answer (even just a few of the points), it would be great. I ask you to share/clarify your thoughts, not defend your PEP.
Sadly, I need more time on this than I have today; I'll get back to it next week.
Again, I'll get back to this next week.
True. I guess what I want from the answer is to know how much thought went into const handling: is what's in the PEP an initial draft, or does it solve some hidden issue?
Here again, I mostly want to know if the details are there for deeper reasons, or just points to polish.
Thank you. Sorry for the way this is dragged out. Would it help to set some timeline/deadlines here?
Thanks. I also like PEP 580's extensibility.
I don't have good general experience with premature extensibility, so I'd not count this as a plus.

Petr, I realize that you are in a difficult position. You'll end up disappointing either me or Mark... I don't know if the steering council or somebody else has a good idea to deal with this situation.
Jeroen has time
Speaking of time, maybe I should clarify that I have time until the end of August: I am working for the OpenDreamKit grant, which allows me to work basically full-time on open source software development but that ends at the end of August.
Here again, I mostly want to know if the details are there for deeper reasons, or just points to polish.
I would say: mostly shallow details. The subclassing thing would be good to resolve, but I don't see any difference between PEP 580 and PEP 590 there. In PEP 580, I wrote a strategy for dealing with subclassing. I believe that it works and that exactly the same idea would work for PEP 590 too. Of course, I may be overlooking something...
I don't have good general experience with premature extensibility, so I'd not count this as a plus.
Fair enough. I also see it more as a "nice to have", not as a big plus.

On Thu, Apr 11, 2019 at 5:06 AM Jeroen Demeyer <J.Demeyer@ugent.be> wrote:
Our answer was "ask Petr to be BDFL Delegate". ;) In all seriousness, none of us on the council or as well equipped as Petr to handle this tough decision, else it would take even longer for us to learn enough to make an informed decision and we would be even worse off. -Brett

On 4/10/19 7:05 PM, Jeroen Demeyer wrote:
All about performance as well as simplicity, correctness, testability, teachability... And PEP 580 touches some introspection :)
I think we're talking past each other. I see now it as: PEP 580 takes existing complexity and makes it available to all users, in a simpler way. It makes existing code faster. PEP 590 defines a new simple/fast protocol for its users, and instead of making existing complexity faster and easier to use, it's left to be deprecated/phased out (or kept in existing classes for backwards compatibility). It makes it possible for future code to be faster/simpler. I think things should be simple by default, but if people want some extra performance, they can opt in to some extra complexity.
Yet, for backwards compatibility reasons, we can't merge the classes. Also, I think CPython and Cython are exactly the users that can trade some extra complexity for better performance.
PEP 580 optimizes all the code paths, where PEP 590 optimizes the fast path, and makes sure most/all use cases can use (or switch to) the fast path. Both fast paths are fast: bridging a->e using zero-copy arg passing with some C calls and flag checks. The PEP 580 approach is faster; PEP 590's is simpler.
That's a good point.
Unless I'm missing something, that would be effectively the same as extending their own instance struct. To bring any benefits, the extended PyCCallDef would need to be standardized in a PEP.

On 2019-04-25 00:24, Petr Viktorin wrote:
Can you elaborate on what you mean with this deprecating/phasing out? What's your view on dealing with method classes (not necessarily right now, but in the future)? Do you think that having separate method classes like method-wrapper (for example [].__add__) is good or bad? Since the way how PEP 580 and PEP 590 deal with bound method classes is very different, I would like to know the roadmap for this. Jeroen.

On 4/25/19 5:12 AM, Jeroen Demeyer wrote:
Kept for backwards compatibility, but not actively recommended or optimized. Perhaps made slower if that would help performance elsewhere.
I fully agree with PEP 579's point on complexity:
There are a huge number of classes involved to implement all variations of methods. This is not a problem by itself, but a compounding issue.
The main problem is that, currently, you sometimes need to care about this (due to CPython special casing its own classes, without fallback to some public API). Ideally, what matters is the protocols the class implements rather than the class itself. If that is solved, having so many different classes becomes curious but unimportant -- merging them shouldn't be a priority. I'd concentrate on two efforts instead: - Calling should have a fast public API. (That's this PEP.) - Introspection should have well-defined, consistently used public API (but not necessarily fast). For introspection, I think the way is implementing the necessary API (e.g. dunder attributes) and changing things like inspect, traceback generation, etc. to use them. CPython's callable classes should stay as internal implementation details. (Specifically: I'm against making them subclassable: allowing subclasses basically makes everything about the superclass an API.)
Since the way how PEP 580 and PEP 590 deal with bound method classes is very different, I would like to know the roadmap for this.
My thoughts are not the roadmap, of course :) Speaking about roadmaps, I often use PEP 579 to check what I'm forgetting. Here are my thoughts on it: ## Naming (The word "built-in" is overused in Python) This is a social/docs problem, and out of scope of the technical efforts. PEPs should always define the terms they use (even in the case where there is an official definition, but it doesn't match popular usage). ## Not extendable As I mentioned above, I'm against opening the callables for subclassing. We should define and use protocols instead. ## cfunctions do not become methods If we were designing Python from scratch, this should have been done differently. Now this is a problem for Cython to solve. CPython should provide the tools to do so. ## Semantics of inspect.isfunction I don't like inspect.isfunction, because "Is it a function?" is almost never what you actually want to ask. I'd like to deprecate it in favor of explicit functions like "Does it have source code?", "Is it callable?", or even "Is it exactly types.FunctionType?". But I'm against changing its behavior -- people are expecting the current answer. ## C functions should have access to the function object That's where my stake in all this is; I want to move on with PEP 573 after 580/590 is sorted out. ## METH_FASTCALL is private and undocumented This is the intersection of PEP 580 and 590. ## Allowing native C arguments This would be a very experimental feature. Argument Clinic itself is not intended for public use, locking its "impl" functions as part of public API is off the table at this point. Cython's cpdef allows this nicely, and CPython's API is full of C functions. That should be good enough good for now. ## Complexity We should simpify, but I think the number of callable classes is not the best metric to focus on. ## PyMethodDef is too limited This is a valid point. But the PyMethodDef array is little more than a shortcut to creating methods directly in a loop. The immediate workaround could be to create a new constructor for methods. Then we can look into expressing the data declaratively again. ## Slot wrappers have no custom documentation I think this can now be done with a new custom slot wrapper class. Perhaps that can be added to CPython when it matures. ## Static methods and class methods should be callable This is a valid, though minor, point. I don't event think it would be a PEP-level change.

On 2019-04-25 23:11, Petr Viktorin wrote:
My thoughts are not the roadmap, of course :)
I asked about methods because we should aware of the consequences when choosing between PEP 580 and PEP 590 (or some compromise). There are basically 3 different ways of dealing with bound methods: (A) put methods inside the protocol. This is PEP 580 and my 580/590 compromise proposal. The disadvantage here is complexity in the protocol. (B) don't put methods inside the protocol and use a single generic method class types.MethodType. This is the status-quo for Python functions. It has the disadvantage of being slightly slower: there is an additional level of indirection when calling a bound method object. (C) don't put methods inside the protocol but use multiple method classes, one for every function class. This is the status-quo for functions implemented in C. This has the disadvantage of code duplication. I think that the choice between PEP 580 or 590 should be done together with a choice of one of the above options. For example, I really don't like the code duplication of (C), so I would prefer PEP 590 with (B) over PEP 590 with (C).

Hi Petr, On 24/04/2019 11:24 pm, Petr Viktorin wrote:
Why do you say that PEP 580's approach is faster? There is no evidence for this. The only evidence so far is a couple of contrived benchmarks. Jeroen's showed a ~1% speedup for PEP 580 and mine showed a ~30% speed up for PEP 590. This clearly shows that I am better and coming up with contrived benchmarks :) PEP 590 was chosen as the fastest protocol I could come up with that was fully general, and wasn't so complex as to be unusable.
Saying that PEP 590 is not extensible is true, but misleading. PEP 590 is fully universal, it supports callables that can do anything with anything. There is no need for it to be extended because it already supports any possible behaviour. Cheers, Mark.

Hi, Petr On 10/04/2019 5:25 pm, Petr Viktorin wrote:
Not quite. Py_TPFLAGS_METHOD_DESCRIPTOR is for LOAD_METHOD/CALL_METHOD, it allows any callable descriptor to benefit from the LOAD_METHOD/CALL_METHOD optimisation. PY_VECTORCALL_ARGUMENTS_OFFSET exists so that callables that make onward calls with an additional argument can do so efficiently. The obvious example is bound-methods, but classes are at least as important. cls(*args) -> cls.new(cls, *args) -> cls.__init__(self, *args)
The updated minimal implementation now uses `const` arguments. Code that uses args[-1] must explicitly cast away the const. https://github.com/markshannon/cpython/blob/vectorcall-minimal/Objects/class...
Firstly, each callable has different behaviour, so it makes sense to be able to do the dispatch from caller to callee in one step. Having a per-object function pointer allows that. Secondly, callables are either large or transient. If large, then the extra few bytes makes little difference. If transient then, it matters even less. The total increase in memory is likely to be only a few tens of kilobytes, even for a large program.
Yes, removing it makes sense. I can then rename the clumsily named "PyObject_VectorCallWithCallable" as "PyObject_VectorCall".
METH_VECTORCALL is just making METH_FASTCALL | METH_KEYWORD documented and public. Would you prefer that it has a different name to prevent confusion with over PY_VECTORCALL_ARGUMENTS_OFFSET? I don't like calling things "fast" or "new" as the names can easily become misleading. New College, Oxford is over 600 years old. Not so "new" any more :)
The minimal implementation is also a complete implementation. Third party code can use the vectorcall protocol immediately use and be called efficiently from the interpreter. I think it is very close to being mergeable. To gain the promised performance improvements is obviously a lot more work, but can be done incrementally over the next few months. Cheers, Mark.

Hi Jeroen, On 15/04/2019 9:38 am, Jeroen Demeyer wrote:
Here's some (untested) code for an implementation of vectorcall for object subtypes implemented in Python. It uses PY_VECTORCALL_ARGUMENTS_OFFSET to save memory allocation when calling the __init__ method. https://github.com/python/cpython/commit/9ff46e3ba0747f386f9519933910d63d5ca... Cheers, Mark.

So, I spent another day pondering the PEPs. I love PEP 590's simplicity and PEP 580's extensibility. As I hinted before, I hope they can they be combined, and I believe we can achieve that by having PEP 590's (o+offset) point not just to function pointer, but to a {function pointer; flags} struct with flags defined for two optimizations: - "Method-like", i.e. compatible with LOAD_METHOD/CALL_METHOD. - "Argument offsetting request", allowing PEP 590's PY_VECTORCALL_ARGUMENTS_OFFSET optimization. This would mean one basic call signature (today's METH_FASTCALL | METH_KEYWORD), with individual optimizations available if both the caller and callee support them. In case you want to know my thoughts or details, let me indulge in some detailed comparisons and commentary that led to this. I also give a more detailed proposal below. Keep in mind I wrote this before I distilled it to the paragraph above, and though the distillation is written as a diff to PEP 590, I still think of this as merging both PEPs. PEP 580 tries hard to work with existing call conventions (like METH_O, METH_VARARGS), making them fast. PEP 590 just defines a new convention. Basically, any callable that wants performance improvements must switch to METH_VECTORCALL (fastcall). I believe PEP 590's approach is OK. To stay as performant as possible, C extension authors will need to adapt their code regularly. If they don't, no harm -- the code will still work as before, and will still be about as fast as it was before. In exchange for this, Python (and Cython, etc.) can focus on optimizing one calling convention, rather than a variety, each with its own advantages and drawbacks. Extending PEP 580 to support a new calling convention will involve defining a new CCALL_* constant, and adding to existing dispatch code. Extending PEP 590 to support a new calling convention will most likely require a new type flag, and either changing the vectorcall semantics or adding a new pointer. To be a bit more concrete, I think of possible extensions to PEP 590 as things like: - Accepting a kwarg dict directly, without copying the items to tuple/array (as in PEP 580's CCALL_VARARGS|CCALL_KEYWORDS) - Prepending more than one positional argument, or appending positional arguments - When an optimization like LOAD_METHOD/CALL_METHOD turns out to no longer be relevant, removing it to simplify/speed up code. I expect we'll later find out that something along these lines might improve performance. PEP 590 would make it hard to experiment. I mentally split PEP 590 into two pieces: formalizing fastcall, plus one major "extension" -- making bound methods fast. When seen this way, this "extension" is quite heavy: it adds an additional type flag, Py_TPFLAGS_METHOD_DESCRIPTOR, and uses a bit in the "Py_ssize_t nargs" argument as additional flag. Both type flags and nargs bits are very limited resources. If I was sure vectorcall is the final best implementation we'll have, I'd go and approve it – but I think we still need room for experimentation, in the form of more such extensions. PEP 580, with its collection of per-instance data and flags, is definitely more extensible. What I don't like about it is that it has the extensions built-in; mandatory for all callers/callees. PEP 580 adds a common data struct to callable instances. Currently these are all data bound methods want to use (cc_flags, cc_func, cc_parent, cr_self). Various flags are consulted in order to deliver the needed info to the underlying function. PEP 590 lets the callable object store data it needs independently. It provides a clever mechanism for pre-allocating space for bound methods' prepended "self" argument, so data can be provided cheaply, though it's still done by the callable itself. Callables that would need to e.g. prepend more than one argument won't be able to use this mechanism, but y'all convinced me that is not worth optimizing for. PEP 580's goal seems to be that making a callable behave like a Python function/method is just a matter of the right set of flags. Jeroen called this "complexity in the protocol". PEP 590, on the other hand, leaves much to individual callable types. This is "complexity in the users of the protocol". I now don't see a problem with PEP 590's approach. Not all users will need the complexity. We need to give CPython and Cython the tools to make implementing "def"-like functions possible (and fast), but if other extensions need to match the behavior of Python functions, they should just use Cython. Emulating Python functions is a special-enough use case that it doesn't justify complicating the protocol, and the same goes for implementing Python's built-in functions (with all their historical baggage). My more full proposal for a compromise between PEP 580 and 590 would go something like below. The type flag (Py_TPFLAGS_HAVE_VECTORCALL/Py_TPFLAGS_HAVE_CCALL) and offset (tp_vectorcall_offset/tp_ccalloffset; in tp_print's place) stay. The offset identifies a per-instance structure with two fields: - Function pointer (with the vectorcall signature) - Flags Storing any other per-instance data (like PEP 580's cr_self/cc_parent) is the responsibility of each callable type. Two flags are defined initially: 1. "Method-like" (like Py_TPFLAGS_METHOD_DESCRIPTOR in PEP 580, or non-NULL cr_self in PEP 580). Having the flag here instead of a type flag will prevent tp_call-only callables from taking advantage of LOAD_METHOD/CALL_METHOD optimisation, but I think that's OK. 2. Request to reserve space for one argument before the args array, as in PEP 590's argument offsetting. If the flag is missing, nargs may not include PY_VECTORCALL_ARGUMENTS_OFFSET. A mechanism incompatible with offsetting may use the bit for another purpose. Both flags may be simply ignored by the caller (or not be set by the callee in the first place), reverting to a more straightforward (but less performant) code path. This should also be the case for any flags added in the future. Note how without these flags, the protocol (and its documentation) will be extremely simple. This mechanism would work with my examples of possible future extensions: - "kwarg dict": A flag would enable the `kwnames` argument to be a dict instead of a tuple. - prepending/appending several positional arguments: The callable's request for how much space to allocate stored right after the {func; flags} struct. As in argument offsetting, a bit in nargs would indicate that the request was honored. (If this was made incompatible with one-arg offsetting, it could reuse the bit.) - removing an optimization: CPython would simply stop using an optimizations (but not remove the flag). Extensions could continue to use the optimization between themselves. As in PEP 590, any class that uses this mechanism shall not be usable as a base class. This will simplify implementation and tests, but hopefully the limitation will be removed in the future. (Maybe even in the initial implementation.) The METH_VECTORCALL (aka CCALL_FASTCALL|CCALL_KEYWORDS) calling convention is added to the public API. The other calling conventions (PEP 580's CCALL_O, CCALL_NOARGS, CCALL_VARARGS, CCALL_KEYWORDS, CCALL_FASTCALL, CCALL_DEFARG) as well as argument type checking (CCALL_OBJCLASS) and self slicing (CCALL_SELFARG) are left up to the callable. No equivalent of PEP 580's restrictions on the __name__ attribute. In my opinion, the PyEval_GetFuncName function should just be deprecated in favor of getting the __name__ attribute and checking if it's a string. It would be possible to add a public helper that returns a proper reference, but that doesn't seem worth it. Either way, I consider this out of scope of this PEP. No equivalent of PEP 580's PyCCall_GenericGetParent and PyCCall_GenericGetQualname either -- again, if needed, they should be retrieved as normal attributes. As I see it, the operation doesn't need to be particularly fast. No equivalent of PEP 580's PyCCall_Call, and no support for dict in PyCCall_FastCall's kwds argument. To be fast, extensions should avoid passing kwargs in a dict. Let's see how far that takes us. (FWIW, this also avoids subtle issues with dict mutability.) Profiling stays as in PEP 580: only exact function types generate the events. As in PEP 580, PyCFunction_GetFlags and PyCFunction_GET_FLAGS are deprecated As in PEP 580, nothing is added to the stable ABI Does that sound reasonable?

On 2019-04-25 00:24, Petr Viktorin wrote:
What's the rationale for putting the flags in the instance? Do you expect flags to be different between one instance and another instance of the same class?
Both type flags and nargs bits are very limited resources.
Type flags are only a limited resource if you think that all flags ever added to a type must be put into tp_flags. There is nothing wrong with adding new fields tp_extraflags or tp_vectorcall_flags to a type.
What I don't like about it is that it has the extensions built-in; mandatory for all callers/callees.
I don't agree with the above sentence about PEP 580: - callers should use APIs like PyCCall_FastCall() and shouldn't need to worry about the implementation details at all. - callees can opt out of all the extensions by not setting any special flags and setting cr_self to a non-NULL value. When using the flags CCALL_FASTCALL | CCALL_KEYWORDS, then implementing the callee is exactly the same as PEP 590.
As in PEP 590, any class that uses this mechanism shall not be usable as a base class.
Can we please lift this restriction? There is really no reason for it. I'm not aware of any similar restriction anywhere in CPython. Note that allowing subclassing is not the same as inheriting the protocol. As a compromise, we could simply never inherit the protocol. Jeroen.

On 4/25/19 10:42 AM, Jeroen Demeyer wrote:
I'm not tied to that idea. If there's a more reasonable place to put the flags, let's go for it, but it's not a big enough issue so it shouldn't complicate the protocol much. Quoting Mark from the other subthread:
Callables are either large or transient. If large, then the extra few bytes makes little difference. If transient then, it matters even less.
Indeed. Extra flags are just what I think PEP 590 is missing.
Imagine an extension author sitting down to read the docs and implement a callable: - PEP 580 introduces 6 CCALL_* combinations: you need to select the best one for your use case. Also, add two structs to the instance & link them via pointers, make sure you support descriptor behavior and the __name__ attribute. (Plus there are features for special purposes: CCALL_DEFARG, CCALL_OBJCLASS, self-slicing, but you can skip that initially.) - My proposal: to the instance, add a function pointer with known signature and flags which you set to zero. Add an offset to the type, and set a type flag. (There are additional possible optimizations, but you can skip them initially.) PEP 580 makes a lot of sense if you read it all, but I fear there'll be very few people who read and understand it. And is not important just for extension authors (admittedly, implementing a callable directly using the C API is often a bad idea). The more people understand the mechanism, the more people can help with further improvements. I don't see the benefit of supporting METH_VARARGS, METH_NOARGS, and METH_O calling conventions (beyond backwards compatibility and comptibility with Python's *args syntax). For keywords, I see a benefit in supporting *only one* of kwarg dict or kwarg tuple: if the caller and callee don't agree on which one to use, you need an expensive conversion. If we say tuple is the way, some of them will need to adapt, but within the set of those that do it any caller/callee combination will be fast. (And if tuple only turns out to be wrong choice, adding dict support in the future shouldn't be hard.) That leaves fastcall (with tuple only) as the focus of this PEP, and the other calling conventions essentially as implementation details of builtin functions/methods.
Sure, let's use PEP 580 treatment of inheritance. Even if we don't, I don't think dropping this restriction would be a PEP-level change. It can be dropped as soon as an implementation and tests are ready, and inheritance issues ironed out. But it doesn't need to be in the initial implementation.
As a compromise, we could simply never inherit the protocol.
That also sounds reasonable for the initial implementation.

Hello, after reading the various comments and thinking about it more, let me propose a real compromise between PEP 580 and PEP 590. My proposal is: take the general framework of PEP 580 but support only a single calling convention like PEP 590. The single calling convention supported would be what is currently specified by the flag combination CCALL_DEFARG|CCALL_FASTCALL|CCALL_KEYWORDS. This way, the flags CCALL_VARARGS, CCALL_FASTCALL, CCALL_O, CCALL_NOARGS, CCALL_KEYWORDS, CCALL_DEFARG can be dropped. This calling convention is very similar to the calling convention of PEP 590, except that: - the callable is replaced by a pointer to a PyCCallDef (the structure from PEP 580, but possibly without cc_parent) - there is a self argument like PEP 580. This implies support for the CCALL_SELFARG flag from PEP 580 and no support for the PY_VECTORCALL_ARGUMENTS_OFFSET trick of PEP 590. Background: I added support for all those calling conventions in PEP 580 because I didn't want to make any compromise regarding performance. When writing PEP 580, I assumed that any kind of performance regression would be a reason to reject PEP 580. However, it seems now that you're willing to accept PEP 590 instead which does introduce performance regressions in certain code paths. So that suggests that we could keep the good parts of PEP 580 but reduce its complexity by having a single calling convention like PEP 590. If you compare this compromise to PEP 590, the main difference is dealing with bound methods. Personally, I really like the idea of having a *single* bound method class which would be used by all kinds of function classes without any loss of performance (not only in CPython itself, but also by Cython and other C extensions). To support that, we need something like the PyCCallRoot structure from PEP 580, together with the special handling for self. About cc_parent and CCALL_OBJCLASS: I prefer to keep that because it allows to merge classes for bare functions (not inside a class) and unbound methods (functions inside a class). Concretely, that could reduce code duplication between builtin_function_or_method and method_descriptor. But I'm also fine with removing cc_parent and CCALL_OBJCLASS. In any case, we can decide that later. What do you think? Jeroen.

Hello! Sorry for the delay; PyCon is keeping me busy. On the other hand, I did get to talk to a lot of smart people here! I'm leaning toward accepting PEP 590 (with some changes still). Let's start focusing on it. As for the changes, I have these 4 points: I feel that the API needs some contact with real users before it's set in stone. That was the motivation behind my proposal for PEP 590 with additional flags. At PyCon, Nick Coghlan suggested another option make the API "provisional": make it formally private. Py_TPFLAGS_HAVE_VECTORCALL would be underscore-prefixed, and the docs would say that it can change. in Python 3.9, the semantics will be finalized and the underscore removed. This would allow high-maintenance projects (like Cython) to start using it and give their feedback, and we'd have a chance to respond to the feedback. tp_vectorcall_offset should be what's replacing tp_print in the struct. The current implementation has tp_vectorcall there. This way, Cython can create vectorcall callables for older Pythons. (See PEP 580: https://www.python.org/dev/peps/pep-0580/#replacing-tp-print). Subclassing should not be forbidden. Jeroen, do you want write a section for how subclassing should work? Given Jeroen's research and ideas that went into the PEP (and hopefully, we'll incorporate some PEP 580 text as well), it seems fair to list him as co-author of the accepted PEP, instead of just listing PEP 580 in the acknowledgement section. On some other points: - Single bound method class for all kinds of function classes: This would be a cleaner design, yes, but I don't see a pressing need. As PEP 579 says, "this is a compounding issue", not a goal. As I recall, that is the only major reason for CCALL_DEFARG. PEP 590 says that x64 Windows passes 4 arguments in registers. Admittedly, I haven't checked this, nor the performance implications (so this would be a good point to argue!), but it seems like a good reason to keep the argument count down. So, no CCALL_DEFARG. - In reply to this Mark's note:
PEP 590 is fully universal, it supports callables that can do anything with anything. There is no need for it to be extended because it already supports any possible behaviour.
I don't buy this point. The current tp_call also supports any possible behavior. Here we want to support any behavior *efficiently*. As a specific example: for calling PEP 590 callable with a kwarg dict, there'll need to be an extra allocation. That's inefficient relative to PEP 580 (or PEP 590 plus allowing a dict in "kwnames"). But I'm willing to believe the inefficiency is acceptable.

On 2019-05-06 00:04, Petr Viktorin wrote:
Just a minor correction here: I guess that you mean CCALL_SELFARG. The flag CCALL_DEFARG is for passing the PyCCallDef* in PEP 580, which is mostly equivalent to passing the callable object in PEP 590. The signature of PEP 580 is func(const PyCCallDef *def, PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames) And with PEP 590 it is func(PyObject *callable, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames) with the additional special role for the PY_VECTORCALL_ARGUMENTS_OFFSET bit (which is meant to solve the problem of "self" in a different way).

On 5/6/19 3:43 AM, Jeroen Demeyer wrote:
I worded that badly, sorry. From PEP 590's `callable`, the called function can get any of these if it needs to (and if they're stored somewhere). But you can't write generic code would get them from any callable. If we're not going for the "single bound method class" idea, that is OK; `def` & `self` can be implementation details of the callables that need them.

Hello Petr, Thanks for your time. I suggest you (or somebody else) to officially reject PEP 580. I start working on reformulating PEP 590, adding some elements from PEP 580. At the same time, I work on the implementation of PEP 590. I want to implement Mark's idea of having a separate wrapper for each old-style calling convention. In the mean time, we can continue the discussion about the details, for example whether to store the flags inside the instance (I don't have an answer for that right now, I'll need to think about it). Petr, did you discuss with the Steering Council? It would be good to have some kind of pre-approval that PEP 590 and its implementation will be accepted. I want to work on PEP 590, but I'm not the right person to "defend" it (I know that it's worse in some ways than PEP 580). Jeroen.

On 5/6/19 4:24 AM, Jeroen Demeyer wrote:
I'll do that shortly. I hope that you are not taking this personally. PEP 580 is a good design. PEP 590 even says that it's built on your ideas.
I'm abandoning per-instance flag proposal. It's an unnecessary complication; per-type flags are fine.
As BDFL-delegate, I'm "pre-approving" PEP 590. I mentioned some details of PEP 590 that still need attention. If there are any more, now's the time to bring them up. And yes, I know that in some ways it's worse than PEP 580. That's what makes it a hard decision.

PEP 590 is on its way to be accepted, with some details still to be discussed. I've rejected PEP 580 so we can focus on one place. Here are things we discussed on GitHub but now seem to agree on: * The vectorcall function's kwname argument can be NULL. * Let's use `vectorcallfunc`, not `vectorcall`, and stop the bikeshedding. * `tp_vectorcall_offset` can be `Py_ssize_t` (The discussions around signedness and C standards and consistency are interesting, but ultimately irrelevant here.) * `PyCall_MakeTpCall` can be removed. * `PyVectorcall_Function` (for getting the `vectorcallfunc` of an object) can be an internal helper. External code should go through `PyCall_Vectorcall` (whatever we name it). * `PY_VECTORCALL_ARGUMENTS_OFFSET` is OK, bikeshedding over variants like `PY_VECTORCALL_PREPEND` won't bring much benefit. Anyone against, make your point now :) The following have discussion PRs open: * `PyCall_MakeVectorCall` name: https://github.com/python/peps/pull/1037 * Passing a dict to `PyObject_Vectorcall`: https://github.com/python/peps/pull/1038 * Type of the kwnames argument (PyObject/PyTupleObject): https://github.com/python/peps/pull/1039 The remaining points are: ### Making things private For Python 3.8, the public API should be private, so the API can get some contact with the real world. I'd especially like to be able to learn from Cython's experience using it. That would mean: * _PyObject_Vectorcall * _PyCall_MakeVectorCall * _PyVectorcall_NARGS * _METH_VECTORCALL * _Py_TPFLAGS_HAVE_VECTORCALL * _Py_TPFLAGS_METHOD_DESCRIPTOR ### Can the kwnames tuple be empty? Disallowing empty tuples means it's easier for the *callee* to detect the case of no keyword arguments. Instead of: if (kwnames != NULL && PyTuple_GET_SIZE(kwnames)) you have: if (kwnames != NULL) On the other hand, the *caller* would now be responsible for handling the no-kwarg case specially. Jeroen points out:
But, if you apply the robustness principle to vectorcallfunc, it should accept empty tuples. ### `METH_VECTORCALL` function type Jeroen suggested changing this from: `PyObject *(*call) (PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwname)` to `vectorcallfunc`, i.e.: `PyObject *(*call) (PyObject *callable, Py_ssize_t n, PyObject *const *args, PyObject *kwnames)` Mark argues that this is a major change and prevents the interpreter from sanity checking the return value of PyMethodDef defined functions. (Since the functions are defined by extension code, they need to be sanity-checked, and this will be done by PyCFunction's vectorcall adapter. Tools like Cython can bypass the check if needed.) The underlying C function should not need to know how to extract "self" from the function object, or how to handle the argument offsetting. Those should be implementation details. I see the value in having METH_VECTORCALL equivalent to the existing METH_FASTCALL|METH_KEYWORDS. (Even though PEP 573 will need to add to the calling convention.)

On Thu, May 9, 2019 at 11:31 AM Petr Viktorin <encukou@gmail.com> wrote:
Any reason the above are all "Vectorcall" and not "VectorCall"? You seem to potentially have that capitalization for "PyCall_MakeVectorCall" as mentioned below which seems to be asking for typos if there's going to be two ways to do it. :) -Brett

On 2019-05-09 23:09, Brett Cannon wrote:
"PyCall_MakeVectorCall" is a typo for "PyVectorcall_Call" (https://github.com/python/peps/pull/1037) Everything else uses "Vectorcall" or "VECTORCALL". In text, we use "vectorcall" without a space.

On 2019-05-09 20:30, Petr Viktorin wrote:
Do we really have to underscore the names? Would there be a way to mark this API as provisional and subject to change without changing the names? If it turns out that PEP 590 was perfect after all, then we're just breaking stuff in Python 3.9 (when removing the underscores) for no reason. Alternatively, could we keep the underscored names as official API in Python 3.9?

On 2019-05-09 20:30, Petr Viktorin wrote:
Maybe you misunderstood my proposal. I want to allow both for extra flexibility: - METH_FASTCALL (possibly combined with METH_KEYWORDS) continues to work as before. If you don't want to care about the implementation details of vectorcall, this is the right thing to use. - METH_VECTORCALL (using exactly the vectorcallfunc signature) is a new calling convention for applications that want the lowest possible overhead at the cost of being slightly harder to use. Personally, I consider the discussion about who is supposed to check that a function returns NULL if and if an error occurred a tiny detail which shouldn't dictate the design. There are two solutions for this: either we move that check one level up and do it for all vectorcall functions. Or, we keep the existing checks in place but we don't do that check for METH_VECTORCALL (this is already more specialized anyway, so dropping that check doesn't hurt much). We could also decide to enable this check only for debug builds, especially if debug builds are going to be easier to use thank to Victor Stinner's work.
I see the value in having METH_VECTORCALL equivalent to the existing METH_FASTCALL|METH_KEYWORDS.
But why invent a new name for that? METH_FASTCALL|METH_KEYWORDS already works. The alias METH_VECTORCALL could only make things more confusing (having two ways to specify exactly the same thing). Or am I missing something? Jeroen.

On 5/9/19 5:33 PM, Jeroen Demeyer wrote:
Then we can, in the spirit of minimalism, not add METH_VECTORCALL at all.
METH_FASTCALL is currently not documented, and it should be renamed before it's documented. Names with "fast" or "new" generally don't age well.

Petr Viktorin schrieb am 10.05.19 um 00:07:
I personally don't see an advantage in having both, apart from helping code that wants to be fast also on Py3.7, for example. It unnecessarily complicates the CPython implementation and C-API. I'd be ok with removing FASTCALL in favour of VECTORCALL. That's more code to generate for Cython in order to adapt to Py<3.6, Py3.6, Py3.7 and then Py>=3.[89], but well, seeing the heap of code that we *already* generate, it's not going to hurt our users much. It would, however, be (selfishly) helpful if FASTCALL could still go through a deprecation period, because we'd like to keep the current Cython 0.29.x release series compatible with Python 3.8, and I'd like to avoid adding support for VECTORCALL and compiling out FASTCALL in a point release. Removing it in Py3.9 seems ok to me. Stefan

On 2019-05-10 00:07, Petr Viktorin wrote:
Just to make sure that we're understanding correctly, is your proposal to do the following: - remove the name METH_FASTCALL - remove the calling convention METH_FASTCALL without METH_KEYWORDS - rename METH_FASTCALL|METH_KEYWORDS -> METH_VECTORCALL

On 2019-05-09 20:30, Petr Viktorin wrote:
But, if you apply the robustness principle to vectorcallfunc, it should accept empty tuples.
Sure, if the callee wants to accept empty tuples anyway, it can do that. That's the robustness principle. But us *forcing* the callee to accept empty tuples is certainly not. Basically my point is: with a little bit of effort in CPython we can make things simpler for all users of vectorcall. Why not do that? Seriously, what's the argument for *not* applying this change? Jeroen.

Hi Jeroen, On 25/04/2019 3:42 pm, Jeroen Demeyer wrote:
AFAICT, any limitations on subclassing exist solely to prevent tp_call and the PEP 580/590 function pointer being in conflict. This limitation is inherent and the same for both PEPs. Do you agree? Let us conside a class C that sets the Py_TPFLAGS_HAVE_CCALL/Py_TPFLAGS_HAVE_VECTORCALL flag. It will set the function pointer in a new instance, C(), when the object is created. If we create a new class D: class D(C): __call__(self, ...): ... and then create an instance `d = D()` then calling d will have two contradictory behaviours; the one installed by C in the function pointer and the one specified by D.__call__ We can ensure correct behaviour by setting the function pointer to NULL or a forwarding function (depending on the implementation) if __call__ has been overridden. This would be enforced at class creation/readying time. Cheers, Mark.

On 2019-04-27 14:07, Mark Shannon wrote:
It's true that the function pointer in D will be wrong but it's also irrelevant since the function pointer won't be used: class D won't have the flag Py_TPFLAGS_HAVE_CCALL/Py_TPFLAGS_HAVE_VECTORCALL set.

Hi Petr, On 24/04/2019 11:24 pm, Petr Viktorin wrote:
A big problem with adding another field to the structure is that it prevents classes from implementing vectorcall. A 30% reduction in the time to create ranges, small lists and sets and to call type(x) is easily worth the a single tp_flag, IMO. As an aside, there are currently over 10 spare flags. As long we don't consume more that one a year, we have over a decade to make tp_flags a uint64_t. It already consumes 64 bits on any 64 bit machine, due to the struct layout. As I've said before, PEP 590 is universal and capable of supporting an implementation of PEP 580 on top of it. Therefore, adding any flags or fields from PEP 580 to PEP 590 will not increase its capability. Since any extra fields will require at least as many memory accesses as before, it will not improve performance and by restricting layout may decrease it.
That would prevent the code having access to the callable object. That access is a fundamental part of both PEP 580 and PEP 590 and the key motivating factor for both.
As I see it, authors of C extensions have five options with PEP 590. Option 4, do nothing, is the recommended option :) 1. Use the PyMethodDef protocol, it will work exactly the same as before. It's already fairly quick in most cases. 2. Use Cython and let Cython take care of handling the vectorcall interface. 3. Use Argument Clinic, and let Argument Clinic take care of handling the vectorcall interface. 4. Do nothing. This the same as 1-3 above depending on what you were already doing. 5. Implement the vectorcall call directly. This might be a bit quicker than the above, but probably not enough to be worth it, unless you are implementing numpy or something like that.
Not just bound methods, any callable that adds an extra argument before dispatching to another callable. This includes builtin-methods, classes and a few others. Setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag states the behaviour of the object when used as a descriptor. It is up to the implementation to use that information how it likes. If LOAD_METHOD/CALL_METHOD gets replaced, then the new implementation can still use this information.
This seems a lot more complex than the caller setting a bit to tell the callee whether it has allocated extra space.

Discussion on PEP 590 (Vectorcall) has been split over several PRs, issues and e-mails, so let me post an update. I am planning to approve PEP 590 with the following changes, if Mark doesn't object to them: * https://github.com/python/peps/pull/1064 (Mark the main API as private to allow changes in Python 3.9) * https://github.com/python/peps/pull/1066 (Use size_t for "number of arguments + flag") The resulting text, for reference: PEP: 590 Title: Vectorcall: a fast calling protocol for CPython Author: Mark Shannon <mark@hotpy.org>, Jeroen Demeyer <J.Demeyer@UGent.be> BDFL-Delegate: Petr Viktorin <encukou@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 29-Mar-2019 Python-Version: 3.8 Post-History: Abstract ======== This PEP introduces a new C API to optimize calls of objects. It introduces a new "vectorcall" protocol and calling convention. This is based on the "fastcall" convention, which is already used internally by CPython. The new features can be used by any user-defined extension class. Most of the new API is private in CPython 3.8. The plan is to finalize semantics and make it public in Python 3.9. **NOTE**: This PEP deals only with the Python/C API, it does not affect the Python language or standard library. Motivation ========== The choice of a calling convention impacts the performance and flexibility of code on either side of the call. Often there is tension between performance and flexibility. The current ``tp_call`` [2]_ calling convention is sufficiently flexible to cover all cases, but its performance is poor. The poor performance is largely a result of having to create intermediate tuples, and possibly intermediate dicts, during the call. This is mitigated in CPython by including special-case code to speed up calls to Python and builtin functions. Unfortunately, this means that other callables such as classes and third party extension objects are called using the slower, more general ``tp_call`` calling convention. This PEP proposes that the calling convention used internally for Python and builtin functions is generalized and published so that all calls can benefit from better performance. The new proposed calling convention is not fully general, but covers the large majority of calls. It is designed to remove the overhead of temporary object creation and multiple indirections. Another source of inefficiency in the ``tp_call`` convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class ``cls``, at least one intermediate object is created for each call in the sequence ``type.__call__``, ``cls.__new__``, ``cls.__init__``. This PEP proposes an interface for use by extension modules. Such interfaces cannot effectively be tested, or designed, without having the consumers in the loop. For that reason, we provide private (underscore-prefixed) names. The API may change (based on consumer feedback) in Python 3.9, where we expect it to be finalized, and the underscores removed. Specification ============= The function pointer type ------------------------- Calls are made through a function pointer taking the following parameters: * ``PyObject *callable``: The called object * ``PyObject *const *args``: A vector of arguments * ``size_t nargs``: The number of arguments plus the optional flag ``PY_VECTORCALL_ARGUMENTS_OFFSET`` (see below) * ``PyObject *kwnames``: Either ``NULL`` or a tuple with the names of the keyword arguments This is implemented by the function pointer type: ``typedef PyObject *(*vectorcallfunc)(PyObject *callable, PyObject *const *args, size_t nargs, PyObject *kwnames);`` Changes to the ``PyTypeObject`` struct -------------------------------------- The unused slot ``printfunc tp_print`` is replaced with ``tp_vectorcall_offset``. It has the type ``Py_ssize_t``. A new ``tp_flags`` flag is added, ``_Py_TPFLAGS_HAVE_VECTORCALL``, which must be set for any class that uses the vectorcall protocol. If ``_Py_TPFLAGS_HAVE_VECTORCALL`` is set, then ``tp_vectorcall_offset`` must be a positive integer. It is the offset into the object of the vectorcall function pointer of type ``vectorcallfunc``. This pointer may be ``NULL``, in which case the behavior is the same as if ``_Py_TPFLAGS_HAVE_VECTORCALL`` was not set. The ``tp_print`` slot is reused as the ``tp_vectorcall_offset`` slot to make it easier for for external projects to backport the vectorcall protocol to earlier Python versions. In particular, the Cython project has shown interest in doing that (see https://mail.python.org/pipermail/python-dev/2018-June/153927.html). Descriptor behavior ------------------- One additional type flag is specified: ``Py_TPFLAGS_METHOD_DESCRIPTOR``. ``Py_TPFLAGS_METHOD_DESCRIPTOR`` should be set if the callable uses the descriptor protocol to create a bound method-like object. This is used by the interpreter to avoid creating temporary objects when calling methods (see ``_PyObject_GetMethod`` and the ``LOAD_METHOD``/``CALL_METHOD`` opcodes). Concretely, if ``Py_TPFLAGS_METHOD_DESCRIPTOR`` is set for ``type(func)``, then: - ``func.__get__(obj, cls)(*args, **kwds)`` (with ``obj`` not None) must be equivalent to ``func(obj, *args, **kwds)``. - ``func.__get__(None, cls)(*args, **kwds)`` must be equivalent to ``func(*args, **kwds)``. There are no restrictions on the object ``func.__get__(obj, cls)``. The latter is not required to implement the vectorcall protocol. The call -------- The call takes the form ``((vectorcallfunc)(((char *)o)+offset))(o, args, n, kwnames)`` where ``offset`` is ``Py_TYPE(o)->tp_vectorcall_offset``. The caller is responsible for creating the ``kwnames`` tuple and ensuring that there are no duplicates in it. ``n`` is the number of postional arguments plus possibly the ``PY_VECTORCALL_ARGUMENTS_OFFSET`` flag. PY_VECTORCALL_ARGUMENTS_OFFSET ------------------------------ The flag ``PY_VECTORCALL_ARGUMENTS_OFFSET`` should be added to ``n`` if the callee is allowed to temporarily change ``args[-1]``. In other words, this can be used if ``args`` points to argument 1 in the allocated vector. The callee must restore the value of ``args[-1]`` before returning. Whenever they can do so cheaply (without allocation), callers are encouraged to use ``PY_VECTORCALL_ARGUMENTS_OFFSET``. Doing so will allow callables such as bound methods to make their onward calls cheaply. The bytecode interpreter already allocates space on the stack for the callable, so it can use this trick at no additional cost. See [3]_ for an example of how ``PY_VECTORCALL_ARGUMENTS_OFFSET`` is used by a callee to avoid allocation. For getting the actual number of arguments from the parameter ``n``, the macro ``PyVectorcall_NARGS(n)`` must be used. This allows for future changes or extensions. New C API and changes to CPython ================================ The following functions or macros are added to the C API: - ``PyObject *_PyObject_Vectorcall(PyObject *obj, PyObject *const *args, size_t nargs, PyObject *keywords)``: Calls ``obj`` with the given arguments. Note that ``nargs`` may include the flag ``PY_VECTORCALL_ARGUMENTS_OFFSET``. The actual number of positional arguments is given by ``PyVectorcall_NARGS(nargs)``. The argument ``keywords`` is a tuple of keyword names or ``NULL``. An empty tuple has the same effect as passing ``NULL``. This uses either the vectorcall protocol or ``tp_call`` internally; if neither is supported, an exception is raised. - ``PyObject *PyVectorcall_Call(PyObject *obj, PyObject *tuple, PyObject *dict)``: Call the object (which must support vectorcall) with the old ``*args`` and ``**kwargs`` calling convention. This is mostly meant to put in the ``tp_call`` slot. - ``Py_ssize_t PyVectorcall_NARGS(size_t nargs)``: Given a vectorcall ``nargs`` argument, return the actual number of arguments. Currently equivalent to ``nargs & ~PY_VECTORCALL_ARGUMENTS_OFFSET``. Subclassing ----------- Extension types inherit the type flag ``_Py_TPFLAGS_HAVE_VECTORCALL`` and the value ``tp_vectorcall_offset`` from the base class, provided that they implement ``tp_call`` the same way as the base class. Additionally, the flag ``Py_TPFLAGS_METHOD_DESCRIPTOR`` is inherited if ``tp_descr_get`` is implemented the same way as the base class. Heap types never inherit the vectorcall protocol because that would not be safe (heap types can be changed dynamically). This restriction may be lifted in the future, but that would require special-casing ``__call__`` in ``type.__setattribute__``. Finalizing the API ================== The underscore in the names ``_PyObject_Vectorcall`` and ``_Py_TPFLAGS_HAVE_VECTORCALL`` indicates that this API may change in minor Python versions. When finalized (which is planned for Python 3.9), they will be renamed to ``PyObject_Vectorcall`` and ``Py_TPFLAGS_HAVE_VECTORCALL``. The old underscore-prefixed names will remain available as aliases. The new API will be documented as normal, but will warn of the above. Semantics for the other names introduced in this PEP (``PyVectorcall_NARGS``, ``PyVectorcall_Call``, ``Py_TPFLAGS_METHOD_DESCRIPTOR``, ``PY_VECTORCALL_ARGUMENTS_OFFSET``) are final. Internal CPython changes ======================== Changes to existing classes --------------------------- The ``function``, ``builtin_function_or_method``, ``method_descriptor``, ``method``, ``wrapper_descriptor``, ``method-wrapper`` classes will use the vectorcall protocol (not all of these will be changed in the initial implementation). For ``builtin_function_or_method`` and ``method_descriptor`` (which use the ``PyMethodDef`` data structure), one could implement a specific vectorcall wrapper for every existing calling convention. Whether or not it is worth doing that remains to be seen. Using the vectorcall protocol for classes ----------------------------------------- For a class ``cls``, creating a new instance using ``cls(xxx)`` requires multiple calls. At least one intermediate object is created for each call in the sequence ``type.__call__``, ``cls.__new__``, ``cls.__init__``. So it makes a lot of sense to use vectorcall for calling classes. This really means implementing the vectorcall protocol for ``type``. Some of the most commonly used classes will use this protocol, probably ``range``, ``list``, ``str``, and ``type``. The ``PyMethodDef`` protocol and Argument Clinic ------------------------------------------------ Argument Clinic [4]_ automatically generates wrapper functions around lower-level callables, providing safe unboxing of primitive types and other safety checks. Argument Clinic could be extended to generate wrapper objects conforming to the new ``vectorcall`` protocol. This will allow execution to flow from the caller to the Argument Clinic generated wrapper and thence to the hand-written code with only a single indirection. Third-party extension classes using vectorcall ============================================== To enable call performance on a par with Python functions and built-in functions, third-party callables should include a ``vectorcallfunc`` function pointer, set ``tp_vectorcall_offset`` to the correct value and add the ``_Py_TPFLAGS_HAVE_VECTORCALL`` flag. Any class that does this must implement the ``tp_call`` function and make sure its behaviour is consistent with the ``vectorcallfunc`` function. Setting ``tp_call`` to ``PyVectorcall_Call`` is sufficient. Performance implications of these changes ========================================= This PEP should not have much impact on the performance of existing code (neither in the positive nor the negative sense). It is mainly meant to allow efficient new code to be written, not to make existing code faster. Nevertheless, this PEP optimizes for ``METH_FASTCALL`` functions. Performance of functions using ``METH_VARARGS`` will become slightly worse. Stable ABI ========== Nothing from this PEP is added to the stable ABI (PEP 384). Alternative Suggestions ======================= bpo-29259 --------- PEP 590 is close to what was proposed in bpo-29259 [#bpo29259]_. The main difference is that this PEP stores the function pointer in the instance rather than in the class. This makes more sense for implementing functions in C, where every instance corresponds to a different C function. It also allows optimizing ``type.__call__``, which is not possible with bpo-29259. PEP 576 and PEP 580 ------------------- Both PEP 576 and PEP 580 are designed to enable 3rd party objects to be both expressive and performant (on a par with CPython objects). The purpose of this PEP is provide a uniform way to call objects in the CPython ecosystem that is both expressive and as performant as possible. This PEP is broader in scope than PEP 576 and uses variable rather than fixed offset function-pointers. The underlying calling convention is similar. Because PEP 576 only allows a fixed offset for the function pointer, it would not allow the improvements to any objects with constraints on their layout. PEP 580 proposes a major change to the ``PyMethodDef`` protocol used to define builtin functions. This PEP provides a more general and simpler mechanism in the form of a new calling convention. This PEP also extends the ``PyMethodDef`` protocol, but merely to formalise existing conventions. Other rejected approaches ------------------------- A longer, 6 argument, form combining both the vector and optional tuple and dictionary arguments was considered. However, it was found that the code to convert between it and the old ``tp_call`` form was overly cumbersome and inefficient. Also, since only 4 arguments are passed in registers on x64 Windows, the two extra arguments would have non-neglible costs. Removing any special cases and making all calls use the ``tp_call`` form was also considered. However, unless a much more efficient way was found to create and destroy tuples, and to a lesser extent dictionaries, then it would be too slow. Acknowledgements ================ Victor Stinner for developing the original "fastcall" calling convention internally to CPython. This PEP codifies and extends his work. References ========== .. [#bpo29259] Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects, https://bugs.python.org/issue29259 .. [2] tp_call/PyObject_Call calling convention https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_call .. [3] Using PY_VECTORCALL_ARGUMENTS_OFFSET in callee https://github.com/markshannon/cpython/blob/vectorcall-minimal/Objects/class... .. [4] Argument Clinic https://docs.python.org/3/howto/clinic.html Reference implementation ======================== A minimal implementation can be found at https://github.com/markshannon/cpython/tree/vectorcall-minimal Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

On 3/24/2019 8:21 AM, Nick Coghlan wrote:
Where do we discuss these? If a delegate has a provisional view, it might help focus discussion if that were known.
* PEP 499: Binding "-m" executed modules under their module name as well as `__main__`
My brief response: +1 unless there is a good reason not. There have been multiple double module problems reported on python-list and likely stackoverflow. And would there be any impact on circular imports? -- Terry Jan Reedy

On 24Mar2019 17:02, Terry Reedy <tjreedy@udel.edu> wrote:
There turn out to be some subtle side effects. The test suite turned up one (easily fixed) in pdb, but there are definitely some more things to investigate. Nick has pointed out pickle and the "python -i" option. I'm digging into these. (Naturally, I have _never_ before used the pdb or pickle modules, or the -i option :-)
Well, by binding the -m module to both __main__ and its name as denoted on the command line one circular import is directly short circuited. Aside from the -m module itself, I don't think there should be any other direct effect on circular imports. Did you have a specific scenario in mind? Cheers, Cameron Simpson <cs@cskk.id.au>

On 3/24/2019 7:00 PM, Cameron Simpson wrote:
I was thinking about IDLE and its tangled web of circular inports, but I am now convinced that this change will not affect it. Indeed, idlelib/pyshell.py already implements idea of the proposal, ending with if __name__ == "__main__": sys.modules['pyshell'] = sys.modules['__main__'] main() (It turns out that this fails for other reasons, which I am looking into. The current recommendation is to start IDLE by runing any of __main__.py (via python -m idlelib), idle.py, idlew.py, or idle.bat.) -- Terry Jan Reedy

On 3/24/2019 10:01 PM, Terry Reedy wrote:
On 3/24/2019 7:00 PM, Cameron Simpson wrote:
After more investigation, I realized that to stop having duplicate modulue: 1. The alias should be 'idlelib.pyshell', not 'pyshell', at least when imports are all absolute. 2. It should be done at the top of the file, before the import of modules that import pyshell. If I run python f:/dev/3x/lib/idlelib/pyshell.py, the PEP patch would have to notice that pyshell is a module within idlelib and alias '__main__' to 'idlelib.pyshell', not 'pyshell'. Would the same be true if within-package import were all relative?
(It turns out that this fails for other reasons, which I am looking into.
Since starting IDLE with pyshell once worked in the past, it appears to be because the startup command for run.py was outdated. Will fix. -- Terry Jan Reedy

On 24Mar2019 23:22, Terry Reedy <tjreedy@udel.edu> wrote:
The PEP499 patch effectively uses __main__.__spec__.name for the name of the alias. Does that simplify your issue? The current PR is here if you want to look at it: https://github.com/python/cpython/pull/12490
2. It should be done at the top of the file, before the import of modules that import pyshell.
Hmm, if PEP499 comes in you shouldn't need to do this at all. If PEP499 gets delayed or rejected I guess you're supporting this without it. Yes, you'll want to do it before any other imports happen (well, as you say, before any which import pyshell). What about (untested): if __name__ == '__main__': if __spec__.name not in sys.modules: sys.modules[__spec__.name] = sys.modules['__main__'] as a forward compatible setup?
I think so because we're using .__spec__.name, which I though was post import name resolution. Testing in my PEP499 branch: Test 1: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> ./python.exe -i Lib/idlelib/pyshell.py Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module '__main__' (<_frozen_importlib_external.SourceFileLoader object at 0x1088e6040>)> >>> sys.modules['pyshell'] <module '__main__' (<_frozen_importlib_external.SourceFileLoader object at 0x1088e6040>)> >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> So pyshell and idlelib.pyshell are distinct here. __main__ and pyshell are the same module, courtesy of your sys.modules assignment at the bottom of pyshell.py. Test 3 below will be with that commented out. Test 2: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> PYTHONPATH=$PWD/Lib ./python.exe -i -m idlelib.pyshell Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> id(sys.modules['__main__']) 4551072712 >>> id(sys.modules['pyshell']) 4551072712 >>> id(sys.modules['idlelib.pyshell']) 4551072712 So this has __main__ and idlelib.pyshell the same module from the PEP499 patch and pyshell also the same from your sys.modules assignment. Test 3, with the pyshell.py sys.modules assignment commented out: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> PYTHONPATH=$PWD/Lib ./python.exe -i -m idlelib.pyshell Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['pyshell'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'pyshell' >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> id(sys.modules['__main__']) 4552379336 >>> id(sys.modules['idlelib.pyshell']) 4552379336 Here we've got __main__ and idlelib.pyshell the same module and no 'pyshell' in sys.modules. I don't think I understand your "relative import" scenario. Cheers, Cameron Simpson <cs@cskk.id.au>

On 3/25/2019 12:27 AM, Cameron Simpson wrote:
The new test passes on Win10.
When I start pyshell in my master repository directory on windows with python -m idlelib.pyshell __spec__.name is 'idlelib.pyshell, which I currently hard-coded. When I start with what should be equivalent python f:/dev/3x/lib/idlelib/pyshell.py __spec__ is None and __spec__.name an attribute error.
You must be doing something different when __spec__ is None ;-). I tested the patch and it does not raise AttributeError with the command above.
This is because of an obsolete 'command = ...' around 420. The if line is correct always and the if/then not needed.
I verified that the module was being executed twice by putting print('running') at the top. __main__ and pyshell
are the same module, courtesy of your sys.modules assignment at the bottom of pyshell.py.
Obsolete and removed. Test 3 below will be with that commented out.
I don't think I understand your "relative import" scenario.
If files other that pyshell used relative 'import ./pyshell' instead of absolute 'import idlelib.pyshell', would the sys.modules key still be 'idlelib.pyshell' or 'pyshell'? Which is to ask, would the alias needed to avoid a second pyshell module still be 'idlelib.pyshell' or 'pyshell'?

On 25Mar2019 03:52, Terry Reedy <tjreedy@udel.edu> wrote:
Um, yes. I presume that since no "import" has been done, there's no import spec (.__spec__). Clearly the above needs to accomodate this, possibly with a fallback guess. Is sniffing the end components of __file__ at all sane? ending in idlelib/pyshell.py or pyshell.py? Or is that just getting baroque? I don't think these are strictly the same from some kind of purist viewpoint: the path might be anything - _is_ it reasonable to suppose that it has a module name (== importable/finding through the import path)?
Indeed. I may have fudged a bit when I said "The PEP499 patch effectively uses __main__.__spec__.name". It modifies runpy.py's _run_module_as_main function, and that is called for the "python -m module_name" invocation, so it can get the module spec because it has a module name. So the patch doesn't have to cope with __spec__ being None. As you say, __spec__ is None for "python path/to/file.py" so __spec__ isn't any use there. Apologies. [...]
Ok. As I understand it Python 3 imports are absolute: without a leading dot a name is absolute, so "import pyshell" should install sys.module['pyshell'] _provided_ that 'pyshell' can be found in the module search path. Conversely, an "import .pyshell" is an import relative to the current module's package name, equivalent to an import of the absolute path "package.name.pyshell", for whatever the package name is. So (a) you can only import '.pyshell' from within a package containing a 'pyshell.py' file and (b) you can't import import '.pyshell' if you're not in a package. I stuffed a "test2.py" into the local idlelib like this: import sys print("running", __file__, __name__) print(repr(sorted(sys.modules))) print(repr(sys.paht)) from pyshell import idle_showwarning print(repr(sorted(sys.modules))) and fiddled with the "from pyshell import idle_showwarning" line. (I'm presuming this is what you have in mind, since "import ./pyshell" elicits a syntax error.) Using "./python.exe -m idlelib.test2": Plain "pyshell" gets an ImportError - no such module. Using ".pyshell" imports the pyshell module as "idlelib.pyshell" in sys.modules. Which was encouraging until I went "./python.exe Lib/idlelib/test2.py". This puts Lib/idlelib (as an absolute path) at the start of sys.path. A plain "pyshell" import works and installs sys.modules['pyshell']. Conversely, trying the ".pyshell" import gets: ModuleNotFoundError: No module named '__main__.pyshell'; '__main__' is not a package So we can get 'pyshell' or 'idlelib.pyshell' into sys.modules depending how we invoke python. HOWEVER, if you're importing the 'pyshell' from idlelib _as found in the module search path_, whether absolutely as 'idlelib.pyshell' or relatives as '.pyshell' from within the idlelib package, you should always get 'idlelib.pyshell' in the sys.modules map. And I don't think you should need to worry about a circular import importing some top level name 'pyshell' because that's not using the idlelib package, so I'd argue it isn't your problem. Thoughts? Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 25 Mar 2019 at 20:34, Cameron Simpson <cs@cskk.id.au> wrote:
Directly executing files from inside Python packages is explicitly unsupported, and nigh guaranteed to result in a broken import setup, as relative imports won't work, and absolute imports will most likely result in a second copy of the script module getting loaded. The problem is that __main__ always thinks it is a top-level module for directly executed scripts - it needs the package structure information from the "-m" switch to learn otherwise. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi Petr, Regarding PEPs 576 and 580. Over the new year, I did a thorough analysis of possible approaches to possible calling conventions for use in the CPython ecosystems and came up with a new PEP. The draft can be found here: https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst I was hoping to profile a branch with the various experimental changes cherry-picked together, but don't seemed to have found the time :( I'd like to have a testable branch, before formally submitting the PEP, but I'd thought you should be aware of the PEP. Cheers, Mark. On 24/03/2019 12:21 pm, Nick Coghlan wrote:

On 2019-03-24 16:22, Mark Shannon wrote:
Thanks for that. Is this new PEP meant to supersede PEP 576?
I'd like to have a testable branch, before formally submitting the PEP, but I'd thought you should be aware of the PEP.
If you want to bring up this PEP now during the PEP 576/580 discussion, maybe it's best to formally submit it now? Having an official PEP number might simplify the discussion. If it turns out to be a bad idea after all, you can still withdraw it. In the mean time, I remind you that PEP 576 also doesn't have a complete reference implementation (the PEP links to a "reference implementation" but it doesn't correspond to the text of the PEP). Jeroen.

On 2019-03-24 16:22, Mark Shannon wrote:
The draft can be found here: https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst
I think that this is basically a better version of PEP 576. The idea is the same as PEP 576, but the details are better. Since it's not fundamentally different from PEP 576, I think that this comparison still stands: https://mail.python.org/pipermail/python-dev/2018-July/154238.html

On Sun, Mar 24, 2019 at 4:22 PM Mark Shannon <mark@hotpy.org> wrote:
Hello Mark, Thank you for letting me know! I wish I knew of this back in January, when you committed the first draft. This is unfair to the competing PEP, which is ready and was waiting for the new govenance. We have lost three months that could be spent pondering the ideas in the pre-PEP. Do you think you will find the time to piece things together? Is there anything that you already know should be changed? Do you have any comments on [Jeroen's comparison]? The pre-PEP is simpler then PEP 580, because it solves simpler issues. I'll need to confirm that it won't paint us into a corner -- that there's a way to address all the issues in PEP 579 in the future. The pre-PEP claims speedups of 2% in initial experiments, with expected overall performance gain of 4% for the standard benchmark suite. That's pretty big. As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython). The pre-PEP has a complication around offsetting arguments by 1 to allow bound methods forward calls cheaply. I fear that this optimizes for current usage with its limitations. PEP 580's cc_parent allows bound methods to have access to the class, and through that, the module object where they are defined and the corresponding module state. To support this, vector calls would need a two-argument offset. (That seems to illustrate the main difference between the motivations of the two PEPs: one focuses on extensibility; the other on optimizing existing use cases.) The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting. [Jeroen's comparison]: https://mail.python.org/pipermail/python-dev/2018-July/154238.html

By lack of a better name, I'm using the name PEP 576bis to refer to https://github.com/markshannon/peps/blob/new-calling-convention/pep-9999.rst (This is why this should get a PEP number soon, even if the PEP is not completely done yet). On 2019-03-27 14:50, Petr Viktorin wrote:
One potential issue is calling bound methods (in the duck typing sense) when the LOAD_METHOD optimization is *not* used. This would happen for example when storing a bound method object somewhere and then calling it (possibly repeatedly). Perhaps that's not a very common thing and we should just live with that. However, since __self__ is part of the PEP 580 protocol, it allows calling a bound method object without any performance penalty compared to calling the underlying function directly. Similarly, a follow-up of PEP 580 could allow zero-overhead calling of static/class methods (I didn't put this in PEP 580 because it's already too long).
As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython).
Cython is indeed the main reason for PEP 580.
The pre-PEP has a complication around offsetting arguments by 1 to allow bound methods forward calls cheaply.
I honestly don't understand what this "offset by one" means or why it's useful. It should be better explained in the PEP.
The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting.
I agree, this is pretty bad. However, I don't think that there is a need for this limitation. PEP 580 solves this by only inheriting the Py_TPFLAGS_HAVE_CCALL flag in specific cases. PEP 576bis could do something similar. Finally, I don't agree with this sentence from PEP 576bis: PEP 580 is specifically targetted at function-like objects, and doesn't support other callables like classes, partial functions, or proxies. It's true that classes are not supported (and I wonder how PEP 576bis deals with that, it would be good to explain that more explicitly) but other callables are not a problem. Jeroen.

On 2019-03-27 14:50, Petr Viktorin wrote:
I re-did my earlier benchmarks for PEP 580 and these are the results: https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248 In general, the PEP 580 timings seem slightly better than vanilla CPython, similar to what Mark got. I'm speculating that the speedup in both cases comes from the removal of type checks and dispatching depending on that, and instead using a single protocol that directly does what needs to be done. Jeroen.

Hi Petr, On 27/03/2019 1:50 pm, Petr Viktorin wrote:
I realize this is less than ideal. I had planned to publish this in December, but life intervened. Nothing bad, just too busy.
Do you think you will find the time to piece things together? Is there anything that you already know should be changed?
I've submitted the final PEP and minimal implementation https://github.com/python/peps/pull/960 https://github.com/python/cpython/compare/master...markshannon:vectorcall-mi...
Do you have any comments on [Jeroen's comparison]?
It is rather out of date, but two comments. 1. `_PyObject_FastCallKeywords()` is used as an example of a call in CPython. It is an internal implementation detail and not a common path. 2. The claim that PEP 580 allows "certain optimizations because other code can make assumptions" is flawed. In general, the caller cannot make assumptions about the callee or vice-versa. Python is a dynamic language.
The pre-PEP is simpler then PEP 580, because it solves simpler issues.
The fundamental issue being addressed is the same, and it is this: Currently third-party C code can either be called quickly or have access to the callable object, not both. Both PEPs address this.
I'll need to confirm that it won't paint us into a corner -- that there's a way to address all the issues in PEP 579 in the future.
PEP 579 is mainly a list of supposed flaws with the 'builtin_function_or_method' class. The general thrust of PEP 579 seems to be that builtin-functions and builtin-methods should be more flexible and extensible than they are. I don't agree. If you want different behaviour, then use a different object. Don't try an cram all this extra behaviour into a pre-existing object. However, if we assume that we are talking about callables implemented in C, in general, then there are 3 key issues covered by PEP 579. 1. Inspection and documentation; it is hard for extensions to have docstrings and signatures. Worth addressing, but completely orthogonal to PEP 590. 2. Extensibility and performance; extensions should have the power of Python functions without suffering slow calls. Allowing the C code access to the callable object is a general solution to this problem. Both PEP 580 and PEP 590 do this. 3. Exposing the underlying implementation and signature of the C code, so that optimisers can avoid unnecessary boxing. This may be worth doing, but until we have an adaptive optimiser capable of exploiting this information, this is premature. Neither PEP 580 nor PEP 590 explicit allow or prevent this.
That's because there is a lot of code around calls in CPython, and it has grown in a rather haphazard fashion. Victor's work to add the "FASTCALL" protocol has helped. PEP 590 seeks to formalise and extend that, so that it can be used more consistently and efficiently.
As far as I can see, PEP 580 claims not much improvement in CPython, but rather large improvements for extensions (Mistune with Cython).
Calls to and from extension code are slow because they have to use the `tp_call` calling convention (or lose access to the callable object). With a calling convention that does not have any special cases, extensions can be as fast as builtin functions. Both PEP 580 and PEP 590 attempt to do this, but PEP 590 is more efficient.
It's optimising for the common case, while allowing the less common. Bound methods and classes need to add one additional argument. Other rarer cases, like `partial` may need to allocate memory, but can still add or remove any number of arguments.
Not true. The first argument in the vector call is the callable itself. Through that it, any callable can access its class, its module or any other object it wants.
I'll reiterate that PEP 590 is more general than PEP 580 and that once the callable's code has access to the callable object (as both PEPs allow) then anything is possible. You can't can get more extensible than that.
The pre-PEP's "any third-party class implementing the new call interface will not be usable as a base class" looks quite limiting.
PEP 580 has the same limitation for the same reasons. The limitation is necessary for correctness if an object supports calls via `__call__` and through another calling convention.
[Jeroen's comparison]: https://mail.python.org/pipermail/python-dev/2018-July/154238.html
Cheers, Mark.

On 2019-03-30 17:30, Mark Shannon wrote:
PEP 580 is meant for extension classes, not Python classes. Extension classes are not dynamic. When you implement tp_call in a given way, the user cannot change it. So if a class implements the C call protocol or the vectorcall protocol, callers can make assumptions about what that means.
I think that there is a misunderstanding here. I fully agree with the "use a different object" solution. This isn't a new solution: it's already possible to implement those different objects (Cython does it). It's just that this solution comes at a performance cost and that's what we want to avoid.
I would argue the opposite: PEP 590 defines a fixed protocol that is not easy to extend. PEP 580 on the other hand uses a new data structure PyCCallDef which could easily be extended in the future (this will intentionally never be part of the stable ABI, so we can do that). I have also argued before that the generality of PEP 590 is a bad thing rather than a good thing: by defining a more rigid protocol as in PEP 580, more optimizations are possible.
I don't think that this limitation is needed in either PEP. As I explained at the top of this email, it can easily be solved by not using the protocol for Python classes. What is wrong with my proposal in PEP 580: https://www.python.org/dev/peps/pep-0580/#inheritance Jeroen.

I added benchmarks for PEP 590: https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248

Hi, On 01/04/2019 6:31 am, Jeroen Demeyer wrote:
I added benchmarks for PEP 590:
https://gist.github.com/jdemeyer/f0d63be8f30dc34cc989cd11d43df248
Thanks. As expected for calls to C function for both PEPs and master perform about the same, as they are using almost the same calling convention under the hood. As an example of the advantage that a general fast calling convention gives you, I have implemented the vectorcall versions of list() and range() https://github.com/markshannon/cpython/compare/vectorcall-minimal...markshan... Which gives a roughly 30% reduction in time for creating ranges, or lists from small tuples. https://gist.github.com/markshannon/5cef3a74369391f6ef937d52cca9bfc8 Cheers, Mark.

On 2019-04-02 21:38, Mark Shannon wrote:
While they are "about the same", in general PEP 580 is slightly faster than master and PEP 590. And PEP 590 actually has a minor slow-down for METH_VARARGS calls. I think that this happens because PEP 580 has less levels of indirection than PEP 590. The vectorcall protocol (PEP 590) changes a slower level (tp_call) by a faster level (vectorcall), while PEP 580 just removes that level entirely: it calls the C function directly. This shows that PEP 580 is really meant to have maximal performance in all cases, accidentally even making existing code faster. Jeroen.

On 3/30/19 11:36 PM, Jeroen Demeyer wrote:
It does seem like there is some misunderstanding. PEP 580 defines a CCall structure, which includes the function pointer, flags, "self" and "parent". Like the current implementation, it has various METH_ flags for various C signatures. When called, the info from CCall is matched up (in relatively complex ways) to what the C function expects. PEP 590 only adds the "vectorcall". It does away with flags and only has one C signatures, which is designed to fit all the existing ones, and is well optimized. Storing the "self"/"parent", and making sure they're passed to the C function is the responsibility of the callable object. There's an optimization for "self" (offsetting using PY_VECTORCALL_ARGUMENTS_OFFSET), and any supporting info can be provided as part of "self".
Anything is possible, but if one of the possibilities becomes common and useful, PEP 590 would make it hard to optimize for it. Python has grown many "METH_*" signatures over the years as we found more things that need to be passed to callables. Why would "METH_VECTORCALL" be the last? If it won't (if you think about it as one more way to call functions), then dedicating a tp_* slot to it sounds quite expensive. In one of the ways to call C functions in PEP 580, the function gets access to: - the arguments, - "self", the object - the class that the method was found in (which is not necessarily type(self)) I still have to read the details, but when combined with LOAD_METHOD/CALL_METHOD optimization (avoiding creation of a "bound method" object), it seems impossible to do this efficiently with just the callable's code and callable's object.
I'll add Jeroen's notes from the review of the proposed PEP 590 (https://github.com/python/peps/pull/960): The statement "PEP 580 is specifically targetted at function-like objects, and doesn't support other callables like classes, partial functions, or proxies" is factually false. The motivation for PEP 580 is certainly function/method-like objects but it's a general protocol that every class can implement. For certain classes, it may not be easy or desirable to do that but it's always possible. Given that `PY_METHOD_DESCRIPTOR` is a flag for tp_flags, shouldn't it be called `Py_TPFLAGS_METHOD_DESCRIPTOR` or something? Py_TPFLAGS_HAVE_VECTOR_CALL should be Py_TPFLAGS_HAVE_VECTORCALL, to be consistent with tp_vectorcall_offset and other uses of "vectorcall" (not "vector call") And mine, so far: I'm not clear on the constness of the "args" array. If it is mutable (PyObject **), you can't, for example, directly pass a tuple's storage (or any other array that could be used in the call). If it is not (PyObject * const *), you can't insert the "self" argument in. The reference implementations seems to be inconsistent here. What's the intention?

Hi, On 02/04/2019 1:49 pm, Petr Viktorin wrote:
I doubt METH_VECTORCALL will be the last. Let me give you an example: It is quite common for a function to take two arguments, so we might want add a METH_OO flag for builtin-functions with 2 parameters. To support this in PEP 590, you would make exactly the same change as you would now; which is to add another case to the switch statement in _PyCFunction_FastCallKeywords. For PEP 580, you would add another case to the switch in PyCCall_FastCall. No difference really. PEP 580 uses a slot as well. It's only 8 bytes per class.
It is possible, and relatively straightforward. Why do you think it is impossible?
Thanks for the comments, I'll update the PEP when I get the chance.
I'll make it clearer in the PEP. My thinking was that if `PY_VECTORCALL_ARGUMENTS_OFFSET` is set then the caller is allowing the callee to mutate element -1. It would make sense to generalise that to any element of the vector (including -1). When passing the contents of a tuple, `PY_VECTORCALL_ARGUMENTS_OFFSET` should not be set, and thus the vector could not be mutated. Cheers, Mark.

Access to the class isn't possible currently and also not with PEP 590. But it's easy enough to fix that: PEP 573 adds a new METH_METHOD flag to change the signature of the C function (not the vectorcall wrapper). PEP 580 supports this "out of the box" because I'm reusing the class also to do type checks. But this shouldn't be an argument for or against either PEP.

As I'm reading the PEP 590 reference implementation, it strikes me how similar it is to https://bugs.python.org/issue29259 The main difference is that bpo-29259 has a per-class pointer tp_fastcall instead of a per-object pointer. But actually, the PEP 590 reference implementation does not make much use of the per-object pointer: for all classes except "type", the vectorcall wrapper is the same for all objects of a given type. One thing that bpo-29259 did not realize is that existing optimizations could be dropped in favor of using tp_fastcall. For example, bpo-29259 has code like if (PyFunction_Check(callable)) { return _PyFunction_FastCallKeywords(...); } if (PyCFunction_Check(callable)) { return _PyCFunction_FastCallKeywords(...); } else if (PyType_HasFeature(..., Py_TPFLAGS_HAVE_FASTCALL) ...) but the first 2 branches are superfluous given the third. Anyway, this is just putting PEP 590 a bit in perspective. It doesn't say anything about the merits of PEP 590. Jeroen.

On 2019-04-03 07:33, Jeroen Demeyer wrote:
Actually, in the answer above I only considered "is implementing PEP 573 possible?" but I did not consider the complexity of doing that. And in line with what I claimed about complexity before, I think that PEP 580 scores better in this regard. Take PEP 580 and assume for the sake of argument that it didn't already have the cc_parent field. Then adding support for PEP 573 is easy: just add the cc_parent field to the C call protocol structure and set that field when initializing a method_descriptor. C functions can use the METH_DEFARG flag to get access to the PyCCallDef structure, which gives cc_parent. Implementing PEP 573 for a custom function class takes no extra effort: it doesn't require any changes to that class, except for correctly initializing the cc_parent field. Since PEP 580 has built-in support for methods, nothing special needs to be done to support methods too. With PEP 590 on the other hand, every single class which is involved in PEP 573 must be changed and every single vectorcall wrapper supporting PEP 573 must be changed. This is not limited to the function class itself, also the corresponding method class (for example, builtin_function_or_method for method_descriptor) needs to be changed. Jeroen

Hello! I've had time for a more thorough reading of PEP 590 and the reference implementation. Thank you for the work! Overall, I like PEP 590's direction. I'd now describe the fundamental difference between PEP 580 and PEP 590 as: - PEP 580 tries to optimize all existing calling conventions - PEP 590 tries to optimize (and expose) the most general calling convention (i.e. fastcall) PEP 580 also does a number of other things, as listed in PEP 579. But I think PEP 590 does not block future PEPs for the other items. On the other hand, PEP 580 has a much more mature implementation -- and that's where it picked up real-world complexity. PEP 590's METH_VECTORCALL is designed to handle all existing use cases, rather than mirroring the existing METH_* varieties. But both PEPs require the callable's code to be modified, so requiring it to switch calling conventions shouldn't be a problem. Jeroen's analysis from https://mail.python.org/pipermail/python-dev/2018-July/154238.html seems to miss a step at the top: a. CALL_FUNCTION* / CALL_METHOD opcode calls b. _PyObject_FastCallKeywords() which calls c. _PyCFunction_FastCallKeywords() which calls d. _PyMethodDef_RawFastCallKeywords() which calls e. the actual C function (*ml_meth)() I think it's more useful to say that both PEPs bridge a->e (via _Py_VectorCall or PyCCall_Call). PEP 590 is built on a simple idea, formalizing fastcall. But it is complicated by PY_VECTORCALL_ARGUMENTS_OFFSET and Py_TPFLAGS_METHOD_DESCRIPTOR. As far as I understand, both are there to avoid intermediate bound-method object for LOAD_METHOD/CALL_METHOD. (They do try to be general, but I don't see any other use case.) Is that right? (I'm running out of time today, but I'll write more on why I'm asking, and on the case I called "impossible" (while avoiding creation of a "bound method" object), later.) The way `const` is handled in the function signatures strikes me as too fragile for public API. I'd like if, as much as possible, PY_VECTORCALL_ARGUMENTS_OFFSET was treated as a special optimization that extension authors can either opt in to, or blissfully ignore. That might mean: - vectorcall, PyObject_VectorCallWithCallable, PyObject_VectorCall, PyCall_MakeTpCall all formally take "PyObject *const *args" - a naïve callee must do "nargs &= ~PY_VECTORCALL_ARGUMENTS_OFFSET" (maybe spelled as "nargs &= PY_VECTORCALL_NARGS_MASK"), but otherwise writes compiler-enforced const-correct code. - if PY_VECTORCALL_ARGUMENTS_OFFSET is set, the callee may modify "args[-1]" (and only that, and after the author has read the docs). Another point I'd like some discussion on is that vectorcall function pointer is per-instance. It looks this is only useful for type objects, but it will add a pointer to every new-style callable object (including functions). That seems wasteful. Why not have a per-type pointer, and for types that need it (like PyTypeObject), make it dispatch to an instance-specific function? Minor things: - "Continued prohibition of callable classes as base classes" -- this section reads as a final. Would you be OK wording this as something other PEPs can tackle? - "PyObject_VectorCall" -- this looks extraneous, and the reference imlementation doesn't need it so far. Can it be removed, or justified? - METH_VECTORCALL is *not* strictly "equivalent to the currently undocumented METH_FASTCALL | METH_KEYWORD flags" (it has the ARGUMENTS_OFFSET complication). - I'd like to officially call this PEP "Vectorcall", see https://github.com/python/peps/pull/984 Mark, what are your plans for next steps with PEP 590? If a volunteer wanted to help you push this forward, what would be the best thing to work on? Jeroen, is there something in PEPs 579/580 that PEP 590 blocks, or should address?

On 2019-04-10 18:25, Petr Viktorin wrote:
And thank you for the review!
And PEP 580 has better performance overall, even for METH_FASTCALL. See this thread: https://mail.python.org/pipermail/python-dev/2019-April/156954.html Since these PEPs are all about performance, I consider this a very relevant argument in favor of PEP 580.
I claim that the complexity in the protocol of PEP 580 is a good thing, as it removes complexity from other places, in particular from the users of the protocol (better have a complex protocol that's simple to use, rather than a simple protocol that's complex to use). As a more concrete example of the simplicity that PEP 580 could bring, CPython currently has 2 classes for bound methods implemented in C: - "builtin_function_or_method" for normal C methods - "method-descriptor" for slot wrappers like __eq__ or __add__ With PEP 590, these classes would need to stay separate to get maximal performance. With PEP 580, just one class for bound methods would be sufficient and there wouldn't be any performance loss. And this extends to custom third-party function/method classes, for example as implemented by Cython.
Agreed.
Not quite. For a builtin_function_or_method, we have with PEP 580: a. call_function() calls d. PyCCall_FastCall which calls e. the actual C function and with PEP 590 it's more like: a. call_function() calls c. _PyCFunction_FastCallKeywords which calls d. _PyMethodDef_RawFastCallKeywords which calls e. the actual C function Level c. above is the vectorcall wrapper, which is a level that PEP 580 doesn't have.
The way `const` is handled in the function signatures strikes me as too fragile for public API.
That's a detail which shouldn't influence the acceptance of either PEP.
Why not have a per-type pointer, and for types that need it (like PyTypeObject), make it dispatch to an instance-specific function?
That would be exactly https://bugs.python.org/issue29259 I'll let Mark comment on this.
Those are indeed details which shouldn't influence the acceptance of either PEP. If you go with PEP 590, then we should discuss this further.
Personally, I think what we need now is a decision between PEP 580 and PEP 590 (there is still the possibility of rejecting both but I really hope that this won't happen). There is a lot of work that still needs to be done after either PEP is accepted, such as: - finish and merge the reference implementation - document everything - use the protocol in more classes where it makes sense (for example, staticmethod, wrapper_descriptor) - use this in Cython - handle more issues from PEP 579 I volunteer to put my time into this, regardless of which PEP is accepted. Of course, I still think that PEP 580 is better, but I also want this functionality even if PEP 590 is accepted.
Jeroen, is there something in PEPs 579/580 that PEP 590 blocks, or should address?
Well, PEP 580 is an extensible protocol while PEP 590 is not. But, PyTypeObject is extensible, so even with PEP 590 one can always extend that (for example, PEP 590 uses a type flag Py_TPFLAGS_METHOD_DESCRIPTOR where PEP 580 instead uses the structs for the C call protocol). But I guess that extending PyTypeObject will be harder to justify (say, in a future PEP) than extending the C call protocol. Also, it's explicitly allowed for users of the PEP 580 protocol to extend the PyCCallDef structure with custom fields. But I don't have a concrete idea of whether that will be useful. Kind regards, Jeroen.

On 4/11/19 1:05 AM, Jeroen Demeyer wrote:
One general note: I am not (yet) choosing between PEP 580 and PEP 590. I am not looking for arguments for/against whole PEPs, but individual ideas which, I believe, can still be mixed & matched. I see the situation this way: - I get about one day per week when I can properly concentrate on CPython. It's frustrating to be the bottleneck. - Jeroen has time, but it would frustrating to work on something that will later be discarded, and it's frustrating to not be able to move the project forward. - Mark has good ideas, but seems to lack the time to polish them, or even test out if they are good. It is probably frustrating to see unpolished ideas rejected. I'm looking for ways to reduce the frustration, given where we are. Jeroen, thank you for the comments. Apologies for not having the time to reply to all of them properly right now. Mark, if you could find the time to answer (even just a few of the points), it would be great. I ask you to share/clarify your thoughts, not defend your PEP.
Sadly, I need more time on this than I have today; I'll get back to it next week.
Again, I'll get back to this next week.
True. I guess what I want from the answer is to know how much thought went into const handling: is what's in the PEP an initial draft, or does it solve some hidden issue?
Here again, I mostly want to know if the details are there for deeper reasons, or just points to polish.
Thank you. Sorry for the way this is dragged out. Would it help to set some timeline/deadlines here?
Thanks. I also like PEP 580's extensibility.
I don't have good general experience with premature extensibility, so I'd not count this as a plus.

Petr, I realize that you are in a difficult position. You'll end up disappointing either me or Mark... I don't know if the steering council or somebody else has a good idea to deal with this situation.
Jeroen has time
Speaking of time, maybe I should clarify that I have time until the end of August: I am working for the OpenDreamKit grant, which allows me to work basically full-time on open source software development but that ends at the end of August.
Here again, I mostly want to know if the details are there for deeper reasons, or just points to polish.
I would say: mostly shallow details. The subclassing thing would be good to resolve, but I don't see any difference between PEP 580 and PEP 590 there. In PEP 580, I wrote a strategy for dealing with subclassing. I believe that it works and that exactly the same idea would work for PEP 590 too. Of course, I may be overlooking something...
I don't have good general experience with premature extensibility, so I'd not count this as a plus.
Fair enough. I also see it more as a "nice to have", not as a big plus.

On Thu, Apr 11, 2019 at 5:06 AM Jeroen Demeyer <J.Demeyer@ugent.be> wrote:
Our answer was "ask Petr to be BDFL Delegate". ;) In all seriousness, none of us on the council or as well equipped as Petr to handle this tough decision, else it would take even longer for us to learn enough to make an informed decision and we would be even worse off. -Brett

On 4/10/19 7:05 PM, Jeroen Demeyer wrote:
All about performance as well as simplicity, correctness, testability, teachability... And PEP 580 touches some introspection :)
I think we're talking past each other. I see now it as: PEP 580 takes existing complexity and makes it available to all users, in a simpler way. It makes existing code faster. PEP 590 defines a new simple/fast protocol for its users, and instead of making existing complexity faster and easier to use, it's left to be deprecated/phased out (or kept in existing classes for backwards compatibility). It makes it possible for future code to be faster/simpler. I think things should be simple by default, but if people want some extra performance, they can opt in to some extra complexity.
Yet, for backwards compatibility reasons, we can't merge the classes. Also, I think CPython and Cython are exactly the users that can trade some extra complexity for better performance.
PEP 580 optimizes all the code paths, where PEP 590 optimizes the fast path, and makes sure most/all use cases can use (or switch to) the fast path. Both fast paths are fast: bridging a->e using zero-copy arg passing with some C calls and flag checks. The PEP 580 approach is faster; PEP 590's is simpler.
That's a good point.
Unless I'm missing something, that would be effectively the same as extending their own instance struct. To bring any benefits, the extended PyCCallDef would need to be standardized in a PEP.

On 2019-04-25 00:24, Petr Viktorin wrote:
Can you elaborate on what you mean with this deprecating/phasing out? What's your view on dealing with method classes (not necessarily right now, but in the future)? Do you think that having separate method classes like method-wrapper (for example [].__add__) is good or bad? Since the way how PEP 580 and PEP 590 deal with bound method classes is very different, I would like to know the roadmap for this. Jeroen.

On 4/25/19 5:12 AM, Jeroen Demeyer wrote:
Kept for backwards compatibility, but not actively recommended or optimized. Perhaps made slower if that would help performance elsewhere.
I fully agree with PEP 579's point on complexity:
There are a huge number of classes involved to implement all variations of methods. This is not a problem by itself, but a compounding issue.
The main problem is that, currently, you sometimes need to care about this (due to CPython special casing its own classes, without fallback to some public API). Ideally, what matters is the protocols the class implements rather than the class itself. If that is solved, having so many different classes becomes curious but unimportant -- merging them shouldn't be a priority. I'd concentrate on two efforts instead: - Calling should have a fast public API. (That's this PEP.) - Introspection should have well-defined, consistently used public API (but not necessarily fast). For introspection, I think the way is implementing the necessary API (e.g. dunder attributes) and changing things like inspect, traceback generation, etc. to use them. CPython's callable classes should stay as internal implementation details. (Specifically: I'm against making them subclassable: allowing subclasses basically makes everything about the superclass an API.)
Since the way how PEP 580 and PEP 590 deal with bound method classes is very different, I would like to know the roadmap for this.
My thoughts are not the roadmap, of course :) Speaking about roadmaps, I often use PEP 579 to check what I'm forgetting. Here are my thoughts on it: ## Naming (The word "built-in" is overused in Python) This is a social/docs problem, and out of scope of the technical efforts. PEPs should always define the terms they use (even in the case where there is an official definition, but it doesn't match popular usage). ## Not extendable As I mentioned above, I'm against opening the callables for subclassing. We should define and use protocols instead. ## cfunctions do not become methods If we were designing Python from scratch, this should have been done differently. Now this is a problem for Cython to solve. CPython should provide the tools to do so. ## Semantics of inspect.isfunction I don't like inspect.isfunction, because "Is it a function?" is almost never what you actually want to ask. I'd like to deprecate it in favor of explicit functions like "Does it have source code?", "Is it callable?", or even "Is it exactly types.FunctionType?". But I'm against changing its behavior -- people are expecting the current answer. ## C functions should have access to the function object That's where my stake in all this is; I want to move on with PEP 573 after 580/590 is sorted out. ## METH_FASTCALL is private and undocumented This is the intersection of PEP 580 and 590. ## Allowing native C arguments This would be a very experimental feature. Argument Clinic itself is not intended for public use, locking its "impl" functions as part of public API is off the table at this point. Cython's cpdef allows this nicely, and CPython's API is full of C functions. That should be good enough good for now. ## Complexity We should simpify, but I think the number of callable classes is not the best metric to focus on. ## PyMethodDef is too limited This is a valid point. But the PyMethodDef array is little more than a shortcut to creating methods directly in a loop. The immediate workaround could be to create a new constructor for methods. Then we can look into expressing the data declaratively again. ## Slot wrappers have no custom documentation I think this can now be done with a new custom slot wrapper class. Perhaps that can be added to CPython when it matures. ## Static methods and class methods should be callable This is a valid, though minor, point. I don't event think it would be a PEP-level change.

On 2019-04-25 23:11, Petr Viktorin wrote:
My thoughts are not the roadmap, of course :)
I asked about methods because we should aware of the consequences when choosing between PEP 580 and PEP 590 (or some compromise). There are basically 3 different ways of dealing with bound methods: (A) put methods inside the protocol. This is PEP 580 and my 580/590 compromise proposal. The disadvantage here is complexity in the protocol. (B) don't put methods inside the protocol and use a single generic method class types.MethodType. This is the status-quo for Python functions. It has the disadvantage of being slightly slower: there is an additional level of indirection when calling a bound method object. (C) don't put methods inside the protocol but use multiple method classes, one for every function class. This is the status-quo for functions implemented in C. This has the disadvantage of code duplication. I think that the choice between PEP 580 or 590 should be done together with a choice of one of the above options. For example, I really don't like the code duplication of (C), so I would prefer PEP 590 with (B) over PEP 590 with (C).

Hi Petr, On 24/04/2019 11:24 pm, Petr Viktorin wrote:
Why do you say that PEP 580's approach is faster? There is no evidence for this. The only evidence so far is a couple of contrived benchmarks. Jeroen's showed a ~1% speedup for PEP 580 and mine showed a ~30% speed up for PEP 590. This clearly shows that I am better and coming up with contrived benchmarks :) PEP 590 was chosen as the fastest protocol I could come up with that was fully general, and wasn't so complex as to be unusable.
Saying that PEP 590 is not extensible is true, but misleading. PEP 590 is fully universal, it supports callables that can do anything with anything. There is no need for it to be extended because it already supports any possible behaviour. Cheers, Mark.

Hi, Petr On 10/04/2019 5:25 pm, Petr Viktorin wrote:
Not quite. Py_TPFLAGS_METHOD_DESCRIPTOR is for LOAD_METHOD/CALL_METHOD, it allows any callable descriptor to benefit from the LOAD_METHOD/CALL_METHOD optimisation. PY_VECTORCALL_ARGUMENTS_OFFSET exists so that callables that make onward calls with an additional argument can do so efficiently. The obvious example is bound-methods, but classes are at least as important. cls(*args) -> cls.new(cls, *args) -> cls.__init__(self, *args)
The updated minimal implementation now uses `const` arguments. Code that uses args[-1] must explicitly cast away the const. https://github.com/markshannon/cpython/blob/vectorcall-minimal/Objects/class...
Firstly, each callable has different behaviour, so it makes sense to be able to do the dispatch from caller to callee in one step. Having a per-object function pointer allows that. Secondly, callables are either large or transient. If large, then the extra few bytes makes little difference. If transient then, it matters even less. The total increase in memory is likely to be only a few tens of kilobytes, even for a large program.
Yes, removing it makes sense. I can then rename the clumsily named "PyObject_VectorCallWithCallable" as "PyObject_VectorCall".
METH_VECTORCALL is just making METH_FASTCALL | METH_KEYWORD documented and public. Would you prefer that it has a different name to prevent confusion with over PY_VECTORCALL_ARGUMENTS_OFFSET? I don't like calling things "fast" or "new" as the names can easily become misleading. New College, Oxford is over 600 years old. Not so "new" any more :)
The minimal implementation is also a complete implementation. Third party code can use the vectorcall protocol immediately use and be called efficiently from the interpreter. I think it is very close to being mergeable. To gain the promised performance improvements is obviously a lot more work, but can be done incrementally over the next few months. Cheers, Mark.

Hi Jeroen, On 15/04/2019 9:38 am, Jeroen Demeyer wrote:
Here's some (untested) code for an implementation of vectorcall for object subtypes implemented in Python. It uses PY_VECTORCALL_ARGUMENTS_OFFSET to save memory allocation when calling the __init__ method. https://github.com/python/cpython/commit/9ff46e3ba0747f386f9519933910d63d5ca... Cheers, Mark.

So, I spent another day pondering the PEPs. I love PEP 590's simplicity and PEP 580's extensibility. As I hinted before, I hope they can they be combined, and I believe we can achieve that by having PEP 590's (o+offset) point not just to function pointer, but to a {function pointer; flags} struct with flags defined for two optimizations: - "Method-like", i.e. compatible with LOAD_METHOD/CALL_METHOD. - "Argument offsetting request", allowing PEP 590's PY_VECTORCALL_ARGUMENTS_OFFSET optimization. This would mean one basic call signature (today's METH_FASTCALL | METH_KEYWORD), with individual optimizations available if both the caller and callee support them. In case you want to know my thoughts or details, let me indulge in some detailed comparisons and commentary that led to this. I also give a more detailed proposal below. Keep in mind I wrote this before I distilled it to the paragraph above, and though the distillation is written as a diff to PEP 590, I still think of this as merging both PEPs. PEP 580 tries hard to work with existing call conventions (like METH_O, METH_VARARGS), making them fast. PEP 590 just defines a new convention. Basically, any callable that wants performance improvements must switch to METH_VECTORCALL (fastcall). I believe PEP 590's approach is OK. To stay as performant as possible, C extension authors will need to adapt their code regularly. If they don't, no harm -- the code will still work as before, and will still be about as fast as it was before. In exchange for this, Python (and Cython, etc.) can focus on optimizing one calling convention, rather than a variety, each with its own advantages and drawbacks. Extending PEP 580 to support a new calling convention will involve defining a new CCALL_* constant, and adding to existing dispatch code. Extending PEP 590 to support a new calling convention will most likely require a new type flag, and either changing the vectorcall semantics or adding a new pointer. To be a bit more concrete, I think of possible extensions to PEP 590 as things like: - Accepting a kwarg dict directly, without copying the items to tuple/array (as in PEP 580's CCALL_VARARGS|CCALL_KEYWORDS) - Prepending more than one positional argument, or appending positional arguments - When an optimization like LOAD_METHOD/CALL_METHOD turns out to no longer be relevant, removing it to simplify/speed up code. I expect we'll later find out that something along these lines might improve performance. PEP 590 would make it hard to experiment. I mentally split PEP 590 into two pieces: formalizing fastcall, plus one major "extension" -- making bound methods fast. When seen this way, this "extension" is quite heavy: it adds an additional type flag, Py_TPFLAGS_METHOD_DESCRIPTOR, and uses a bit in the "Py_ssize_t nargs" argument as additional flag. Both type flags and nargs bits are very limited resources. If I was sure vectorcall is the final best implementation we'll have, I'd go and approve it – but I think we still need room for experimentation, in the form of more such extensions. PEP 580, with its collection of per-instance data and flags, is definitely more extensible. What I don't like about it is that it has the extensions built-in; mandatory for all callers/callees. PEP 580 adds a common data struct to callable instances. Currently these are all data bound methods want to use (cc_flags, cc_func, cc_parent, cr_self). Various flags are consulted in order to deliver the needed info to the underlying function. PEP 590 lets the callable object store data it needs independently. It provides a clever mechanism for pre-allocating space for bound methods' prepended "self" argument, so data can be provided cheaply, though it's still done by the callable itself. Callables that would need to e.g. prepend more than one argument won't be able to use this mechanism, but y'all convinced me that is not worth optimizing for. PEP 580's goal seems to be that making a callable behave like a Python function/method is just a matter of the right set of flags. Jeroen called this "complexity in the protocol". PEP 590, on the other hand, leaves much to individual callable types. This is "complexity in the users of the protocol". I now don't see a problem with PEP 590's approach. Not all users will need the complexity. We need to give CPython and Cython the tools to make implementing "def"-like functions possible (and fast), but if other extensions need to match the behavior of Python functions, they should just use Cython. Emulating Python functions is a special-enough use case that it doesn't justify complicating the protocol, and the same goes for implementing Python's built-in functions (with all their historical baggage). My more full proposal for a compromise between PEP 580 and 590 would go something like below. The type flag (Py_TPFLAGS_HAVE_VECTORCALL/Py_TPFLAGS_HAVE_CCALL) and offset (tp_vectorcall_offset/tp_ccalloffset; in tp_print's place) stay. The offset identifies a per-instance structure with two fields: - Function pointer (with the vectorcall signature) - Flags Storing any other per-instance data (like PEP 580's cr_self/cc_parent) is the responsibility of each callable type. Two flags are defined initially: 1. "Method-like" (like Py_TPFLAGS_METHOD_DESCRIPTOR in PEP 580, or non-NULL cr_self in PEP 580). Having the flag here instead of a type flag will prevent tp_call-only callables from taking advantage of LOAD_METHOD/CALL_METHOD optimisation, but I think that's OK. 2. Request to reserve space for one argument before the args array, as in PEP 590's argument offsetting. If the flag is missing, nargs may not include PY_VECTORCALL_ARGUMENTS_OFFSET. A mechanism incompatible with offsetting may use the bit for another purpose. Both flags may be simply ignored by the caller (or not be set by the callee in the first place), reverting to a more straightforward (but less performant) code path. This should also be the case for any flags added in the future. Note how without these flags, the protocol (and its documentation) will be extremely simple. This mechanism would work with my examples of possible future extensions: - "kwarg dict": A flag would enable the `kwnames` argument to be a dict instead of a tuple. - prepending/appending several positional arguments: The callable's request for how much space to allocate stored right after the {func; flags} struct. As in argument offsetting, a bit in nargs would indicate that the request was honored. (If this was made incompatible with one-arg offsetting, it could reuse the bit.) - removing an optimization: CPython would simply stop using an optimizations (but not remove the flag). Extensions could continue to use the optimization between themselves. As in PEP 590, any class that uses this mechanism shall not be usable as a base class. This will simplify implementation and tests, but hopefully the limitation will be removed in the future. (Maybe even in the initial implementation.) The METH_VECTORCALL (aka CCALL_FASTCALL|CCALL_KEYWORDS) calling convention is added to the public API. The other calling conventions (PEP 580's CCALL_O, CCALL_NOARGS, CCALL_VARARGS, CCALL_KEYWORDS, CCALL_FASTCALL, CCALL_DEFARG) as well as argument type checking (CCALL_OBJCLASS) and self slicing (CCALL_SELFARG) are left up to the callable. No equivalent of PEP 580's restrictions on the __name__ attribute. In my opinion, the PyEval_GetFuncName function should just be deprecated in favor of getting the __name__ attribute and checking if it's a string. It would be possible to add a public helper that returns a proper reference, but that doesn't seem worth it. Either way, I consider this out of scope of this PEP. No equivalent of PEP 580's PyCCall_GenericGetParent and PyCCall_GenericGetQualname either -- again, if needed, they should be retrieved as normal attributes. As I see it, the operation doesn't need to be particularly fast. No equivalent of PEP 580's PyCCall_Call, and no support for dict in PyCCall_FastCall's kwds argument. To be fast, extensions should avoid passing kwargs in a dict. Let's see how far that takes us. (FWIW, this also avoids subtle issues with dict mutability.) Profiling stays as in PEP 580: only exact function types generate the events. As in PEP 580, PyCFunction_GetFlags and PyCFunction_GET_FLAGS are deprecated As in PEP 580, nothing is added to the stable ABI Does that sound reasonable?

On 2019-04-25 00:24, Petr Viktorin wrote:
What's the rationale for putting the flags in the instance? Do you expect flags to be different between one instance and another instance of the same class?
Both type flags and nargs bits are very limited resources.
Type flags are only a limited resource if you think that all flags ever added to a type must be put into tp_flags. There is nothing wrong with adding new fields tp_extraflags or tp_vectorcall_flags to a type.
What I don't like about it is that it has the extensions built-in; mandatory for all callers/callees.
I don't agree with the above sentence about PEP 580: - callers should use APIs like PyCCall_FastCall() and shouldn't need to worry about the implementation details at all. - callees can opt out of all the extensions by not setting any special flags and setting cr_self to a non-NULL value. When using the flags CCALL_FASTCALL | CCALL_KEYWORDS, then implementing the callee is exactly the same as PEP 590.
As in PEP 590, any class that uses this mechanism shall not be usable as a base class.
Can we please lift this restriction? There is really no reason for it. I'm not aware of any similar restriction anywhere in CPython. Note that allowing subclassing is not the same as inheriting the protocol. As a compromise, we could simply never inherit the protocol. Jeroen.

On 4/25/19 10:42 AM, Jeroen Demeyer wrote:
I'm not tied to that idea. If there's a more reasonable place to put the flags, let's go for it, but it's not a big enough issue so it shouldn't complicate the protocol much. Quoting Mark from the other subthread:
Callables are either large or transient. If large, then the extra few bytes makes little difference. If transient then, it matters even less.
Indeed. Extra flags are just what I think PEP 590 is missing.
Imagine an extension author sitting down to read the docs and implement a callable: - PEP 580 introduces 6 CCALL_* combinations: you need to select the best one for your use case. Also, add two structs to the instance & link them via pointers, make sure you support descriptor behavior and the __name__ attribute. (Plus there are features for special purposes: CCALL_DEFARG, CCALL_OBJCLASS, self-slicing, but you can skip that initially.) - My proposal: to the instance, add a function pointer with known signature and flags which you set to zero. Add an offset to the type, and set a type flag. (There are additional possible optimizations, but you can skip them initially.) PEP 580 makes a lot of sense if you read it all, but I fear there'll be very few people who read and understand it. And is not important just for extension authors (admittedly, implementing a callable directly using the C API is often a bad idea). The more people understand the mechanism, the more people can help with further improvements. I don't see the benefit of supporting METH_VARARGS, METH_NOARGS, and METH_O calling conventions (beyond backwards compatibility and comptibility with Python's *args syntax). For keywords, I see a benefit in supporting *only one* of kwarg dict or kwarg tuple: if the caller and callee don't agree on which one to use, you need an expensive conversion. If we say tuple is the way, some of them will need to adapt, but within the set of those that do it any caller/callee combination will be fast. (And if tuple only turns out to be wrong choice, adding dict support in the future shouldn't be hard.) That leaves fastcall (with tuple only) as the focus of this PEP, and the other calling conventions essentially as implementation details of builtin functions/methods.
Sure, let's use PEP 580 treatment of inheritance. Even if we don't, I don't think dropping this restriction would be a PEP-level change. It can be dropped as soon as an implementation and tests are ready, and inheritance issues ironed out. But it doesn't need to be in the initial implementation.
As a compromise, we could simply never inherit the protocol.
That also sounds reasonable for the initial implementation.

Hello, after reading the various comments and thinking about it more, let me propose a real compromise between PEP 580 and PEP 590. My proposal is: take the general framework of PEP 580 but support only a single calling convention like PEP 590. The single calling convention supported would be what is currently specified by the flag combination CCALL_DEFARG|CCALL_FASTCALL|CCALL_KEYWORDS. This way, the flags CCALL_VARARGS, CCALL_FASTCALL, CCALL_O, CCALL_NOARGS, CCALL_KEYWORDS, CCALL_DEFARG can be dropped. This calling convention is very similar to the calling convention of PEP 590, except that: - the callable is replaced by a pointer to a PyCCallDef (the structure from PEP 580, but possibly without cc_parent) - there is a self argument like PEP 580. This implies support for the CCALL_SELFARG flag from PEP 580 and no support for the PY_VECTORCALL_ARGUMENTS_OFFSET trick of PEP 590. Background: I added support for all those calling conventions in PEP 580 because I didn't want to make any compromise regarding performance. When writing PEP 580, I assumed that any kind of performance regression would be a reason to reject PEP 580. However, it seems now that you're willing to accept PEP 590 instead which does introduce performance regressions in certain code paths. So that suggests that we could keep the good parts of PEP 580 but reduce its complexity by having a single calling convention like PEP 590. If you compare this compromise to PEP 590, the main difference is dealing with bound methods. Personally, I really like the idea of having a *single* bound method class which would be used by all kinds of function classes without any loss of performance (not only in CPython itself, but also by Cython and other C extensions). To support that, we need something like the PyCCallRoot structure from PEP 580, together with the special handling for self. About cc_parent and CCALL_OBJCLASS: I prefer to keep that because it allows to merge classes for bare functions (not inside a class) and unbound methods (functions inside a class). Concretely, that could reduce code duplication between builtin_function_or_method and method_descriptor. But I'm also fine with removing cc_parent and CCALL_OBJCLASS. In any case, we can decide that later. What do you think? Jeroen.

Hello! Sorry for the delay; PyCon is keeping me busy. On the other hand, I did get to talk to a lot of smart people here! I'm leaning toward accepting PEP 590 (with some changes still). Let's start focusing on it. As for the changes, I have these 4 points: I feel that the API needs some contact with real users before it's set in stone. That was the motivation behind my proposal for PEP 590 with additional flags. At PyCon, Nick Coghlan suggested another option make the API "provisional": make it formally private. Py_TPFLAGS_HAVE_VECTORCALL would be underscore-prefixed, and the docs would say that it can change. in Python 3.9, the semantics will be finalized and the underscore removed. This would allow high-maintenance projects (like Cython) to start using it and give their feedback, and we'd have a chance to respond to the feedback. tp_vectorcall_offset should be what's replacing tp_print in the struct. The current implementation has tp_vectorcall there. This way, Cython can create vectorcall callables for older Pythons. (See PEP 580: https://www.python.org/dev/peps/pep-0580/#replacing-tp-print). Subclassing should not be forbidden. Jeroen, do you want write a section for how subclassing should work? Given Jeroen's research and ideas that went into the PEP (and hopefully, we'll incorporate some PEP 580 text as well), it seems fair to list him as co-author of the accepted PEP, instead of just listing PEP 580 in the acknowledgement section. On some other points: - Single bound method class for all kinds of function classes: This would be a cleaner design, yes, but I don't see a pressing need. As PEP 579 says, "this is a compounding issue", not a goal. As I recall, that is the only major reason for CCALL_DEFARG. PEP 590 says that x64 Windows passes 4 arguments in registers. Admittedly, I haven't checked this, nor the performance implications (so this would be a good point to argue!), but it seems like a good reason to keep the argument count down. So, no CCALL_DEFARG. - In reply to this Mark's note:
PEP 590 is fully universal, it supports callables that can do anything with anything. There is no need for it to be extended because it already supports any possible behaviour.
I don't buy this point. The current tp_call also supports any possible behavior. Here we want to support any behavior *efficiently*. As a specific example: for calling PEP 590 callable with a kwarg dict, there'll need to be an extra allocation. That's inefficient relative to PEP 580 (or PEP 590 plus allowing a dict in "kwnames"). But I'm willing to believe the inefficiency is acceptable.

On 2019-05-06 00:04, Petr Viktorin wrote:
Just a minor correction here: I guess that you mean CCALL_SELFARG. The flag CCALL_DEFARG is for passing the PyCCallDef* in PEP 580, which is mostly equivalent to passing the callable object in PEP 590. The signature of PEP 580 is func(const PyCCallDef *def, PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames) And with PEP 590 it is func(PyObject *callable, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames) with the additional special role for the PY_VECTORCALL_ARGUMENTS_OFFSET bit (which is meant to solve the problem of "self" in a different way).

On 5/6/19 3:43 AM, Jeroen Demeyer wrote:
I worded that badly, sorry. From PEP 590's `callable`, the called function can get any of these if it needs to (and if they're stored somewhere). But you can't write generic code would get them from any callable. If we're not going for the "single bound method class" idea, that is OK; `def` & `self` can be implementation details of the callables that need them.

Hello Petr, Thanks for your time. I suggest you (or somebody else) to officially reject PEP 580. I start working on reformulating PEP 590, adding some elements from PEP 580. At the same time, I work on the implementation of PEP 590. I want to implement Mark's idea of having a separate wrapper for each old-style calling convention. In the mean time, we can continue the discussion about the details, for example whether to store the flags inside the instance (I don't have an answer for that right now, I'll need to think about it). Petr, did you discuss with the Steering Council? It would be good to have some kind of pre-approval that PEP 590 and its implementation will be accepted. I want to work on PEP 590, but I'm not the right person to "defend" it (I know that it's worse in some ways than PEP 580). Jeroen.

On 5/6/19 4:24 AM, Jeroen Demeyer wrote:
I'll do that shortly. I hope that you are not taking this personally. PEP 580 is a good design. PEP 590 even says that it's built on your ideas.
I'm abandoning per-instance flag proposal. It's an unnecessary complication; per-type flags are fine.
As BDFL-delegate, I'm "pre-approving" PEP 590. I mentioned some details of PEP 590 that still need attention. If there are any more, now's the time to bring them up. And yes, I know that in some ways it's worse than PEP 580. That's what makes it a hard decision.

PEP 590 is on its way to be accepted, with some details still to be discussed. I've rejected PEP 580 so we can focus on one place. Here are things we discussed on GitHub but now seem to agree on: * The vectorcall function's kwname argument can be NULL. * Let's use `vectorcallfunc`, not `vectorcall`, and stop the bikeshedding. * `tp_vectorcall_offset` can be `Py_ssize_t` (The discussions around signedness and C standards and consistency are interesting, but ultimately irrelevant here.) * `PyCall_MakeTpCall` can be removed. * `PyVectorcall_Function` (for getting the `vectorcallfunc` of an object) can be an internal helper. External code should go through `PyCall_Vectorcall` (whatever we name it). * `PY_VECTORCALL_ARGUMENTS_OFFSET` is OK, bikeshedding over variants like `PY_VECTORCALL_PREPEND` won't bring much benefit. Anyone against, make your point now :) The following have discussion PRs open: * `PyCall_MakeVectorCall` name: https://github.com/python/peps/pull/1037 * Passing a dict to `PyObject_Vectorcall`: https://github.com/python/peps/pull/1038 * Type of the kwnames argument (PyObject/PyTupleObject): https://github.com/python/peps/pull/1039 The remaining points are: ### Making things private For Python 3.8, the public API should be private, so the API can get some contact with the real world. I'd especially like to be able to learn from Cython's experience using it. That would mean: * _PyObject_Vectorcall * _PyCall_MakeVectorCall * _PyVectorcall_NARGS * _METH_VECTORCALL * _Py_TPFLAGS_HAVE_VECTORCALL * _Py_TPFLAGS_METHOD_DESCRIPTOR ### Can the kwnames tuple be empty? Disallowing empty tuples means it's easier for the *callee* to detect the case of no keyword arguments. Instead of: if (kwnames != NULL && PyTuple_GET_SIZE(kwnames)) you have: if (kwnames != NULL) On the other hand, the *caller* would now be responsible for handling the no-kwarg case specially. Jeroen points out:
But, if you apply the robustness principle to vectorcallfunc, it should accept empty tuples. ### `METH_VECTORCALL` function type Jeroen suggested changing this from: `PyObject *(*call) (PyObject *self, PyObject *const *args, Py_ssize_t nargs, PyObject *kwname)` to `vectorcallfunc`, i.e.: `PyObject *(*call) (PyObject *callable, Py_ssize_t n, PyObject *const *args, PyObject *kwnames)` Mark argues that this is a major change and prevents the interpreter from sanity checking the return value of PyMethodDef defined functions. (Since the functions are defined by extension code, they need to be sanity-checked, and this will be done by PyCFunction's vectorcall adapter. Tools like Cython can bypass the check if needed.) The underlying C function should not need to know how to extract "self" from the function object, or how to handle the argument offsetting. Those should be implementation details. I see the value in having METH_VECTORCALL equivalent to the existing METH_FASTCALL|METH_KEYWORDS. (Even though PEP 573 will need to add to the calling convention.)

On Thu, May 9, 2019 at 11:31 AM Petr Viktorin <encukou@gmail.com> wrote:
Any reason the above are all "Vectorcall" and not "VectorCall"? You seem to potentially have that capitalization for "PyCall_MakeVectorCall" as mentioned below which seems to be asking for typos if there's going to be two ways to do it. :) -Brett

On 2019-05-09 23:09, Brett Cannon wrote:
"PyCall_MakeVectorCall" is a typo for "PyVectorcall_Call" (https://github.com/python/peps/pull/1037) Everything else uses "Vectorcall" or "VECTORCALL". In text, we use "vectorcall" without a space.

On 2019-05-09 20:30, Petr Viktorin wrote:
Do we really have to underscore the names? Would there be a way to mark this API as provisional and subject to change without changing the names? If it turns out that PEP 590 was perfect after all, then we're just breaking stuff in Python 3.9 (when removing the underscores) for no reason. Alternatively, could we keep the underscored names as official API in Python 3.9?

On 2019-05-09 20:30, Petr Viktorin wrote:
Maybe you misunderstood my proposal. I want to allow both for extra flexibility: - METH_FASTCALL (possibly combined with METH_KEYWORDS) continues to work as before. If you don't want to care about the implementation details of vectorcall, this is the right thing to use. - METH_VECTORCALL (using exactly the vectorcallfunc signature) is a new calling convention for applications that want the lowest possible overhead at the cost of being slightly harder to use. Personally, I consider the discussion about who is supposed to check that a function returns NULL if and if an error occurred a tiny detail which shouldn't dictate the design. There are two solutions for this: either we move that check one level up and do it for all vectorcall functions. Or, we keep the existing checks in place but we don't do that check for METH_VECTORCALL (this is already more specialized anyway, so dropping that check doesn't hurt much). We could also decide to enable this check only for debug builds, especially if debug builds are going to be easier to use thank to Victor Stinner's work.
I see the value in having METH_VECTORCALL equivalent to the existing METH_FASTCALL|METH_KEYWORDS.
But why invent a new name for that? METH_FASTCALL|METH_KEYWORDS already works. The alias METH_VECTORCALL could only make things more confusing (having two ways to specify exactly the same thing). Or am I missing something? Jeroen.

On 5/9/19 5:33 PM, Jeroen Demeyer wrote:
Then we can, in the spirit of minimalism, not add METH_VECTORCALL at all.
METH_FASTCALL is currently not documented, and it should be renamed before it's documented. Names with "fast" or "new" generally don't age well.

Petr Viktorin schrieb am 10.05.19 um 00:07:
I personally don't see an advantage in having both, apart from helping code that wants to be fast also on Py3.7, for example. It unnecessarily complicates the CPython implementation and C-API. I'd be ok with removing FASTCALL in favour of VECTORCALL. That's more code to generate for Cython in order to adapt to Py<3.6, Py3.6, Py3.7 and then Py>=3.[89], but well, seeing the heap of code that we *already* generate, it's not going to hurt our users much. It would, however, be (selfishly) helpful if FASTCALL could still go through a deprecation period, because we'd like to keep the current Cython 0.29.x release series compatible with Python 3.8, and I'd like to avoid adding support for VECTORCALL and compiling out FASTCALL in a point release. Removing it in Py3.9 seems ok to me. Stefan

On 2019-05-10 00:07, Petr Viktorin wrote:
Just to make sure that we're understanding correctly, is your proposal to do the following: - remove the name METH_FASTCALL - remove the calling convention METH_FASTCALL without METH_KEYWORDS - rename METH_FASTCALL|METH_KEYWORDS -> METH_VECTORCALL

On 2019-05-09 20:30, Petr Viktorin wrote:
But, if you apply the robustness principle to vectorcallfunc, it should accept empty tuples.
Sure, if the callee wants to accept empty tuples anyway, it can do that. That's the robustness principle. But us *forcing* the callee to accept empty tuples is certainly not. Basically my point is: with a little bit of effort in CPython we can make things simpler for all users of vectorcall. Why not do that? Seriously, what's the argument for *not* applying this change? Jeroen.

Hi Jeroen, On 25/04/2019 3:42 pm, Jeroen Demeyer wrote:
AFAICT, any limitations on subclassing exist solely to prevent tp_call and the PEP 580/590 function pointer being in conflict. This limitation is inherent and the same for both PEPs. Do you agree? Let us conside a class C that sets the Py_TPFLAGS_HAVE_CCALL/Py_TPFLAGS_HAVE_VECTORCALL flag. It will set the function pointer in a new instance, C(), when the object is created. If we create a new class D: class D(C): __call__(self, ...): ... and then create an instance `d = D()` then calling d will have two contradictory behaviours; the one installed by C in the function pointer and the one specified by D.__call__ We can ensure correct behaviour by setting the function pointer to NULL or a forwarding function (depending on the implementation) if __call__ has been overridden. This would be enforced at class creation/readying time. Cheers, Mark.

On 2019-04-27 14:07, Mark Shannon wrote:
It's true that the function pointer in D will be wrong but it's also irrelevant since the function pointer won't be used: class D won't have the flag Py_TPFLAGS_HAVE_CCALL/Py_TPFLAGS_HAVE_VECTORCALL set.

Hi Petr, On 24/04/2019 11:24 pm, Petr Viktorin wrote:
A big problem with adding another field to the structure is that it prevents classes from implementing vectorcall. A 30% reduction in the time to create ranges, small lists and sets and to call type(x) is easily worth the a single tp_flag, IMO. As an aside, there are currently over 10 spare flags. As long we don't consume more that one a year, we have over a decade to make tp_flags a uint64_t. It already consumes 64 bits on any 64 bit machine, due to the struct layout. As I've said before, PEP 590 is universal and capable of supporting an implementation of PEP 580 on top of it. Therefore, adding any flags or fields from PEP 580 to PEP 590 will not increase its capability. Since any extra fields will require at least as many memory accesses as before, it will not improve performance and by restricting layout may decrease it.
That would prevent the code having access to the callable object. That access is a fundamental part of both PEP 580 and PEP 590 and the key motivating factor for both.
As I see it, authors of C extensions have five options with PEP 590. Option 4, do nothing, is the recommended option :) 1. Use the PyMethodDef protocol, it will work exactly the same as before. It's already fairly quick in most cases. 2. Use Cython and let Cython take care of handling the vectorcall interface. 3. Use Argument Clinic, and let Argument Clinic take care of handling the vectorcall interface. 4. Do nothing. This the same as 1-3 above depending on what you were already doing. 5. Implement the vectorcall call directly. This might be a bit quicker than the above, but probably not enough to be worth it, unless you are implementing numpy or something like that.
Not just bound methods, any callable that adds an extra argument before dispatching to another callable. This includes builtin-methods, classes and a few others. Setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag states the behaviour of the object when used as a descriptor. It is up to the implementation to use that information how it likes. If LOAD_METHOD/CALL_METHOD gets replaced, then the new implementation can still use this information.
This seems a lot more complex than the caller setting a bit to tell the callee whether it has allocated extra space.

Discussion on PEP 590 (Vectorcall) has been split over several PRs, issues and e-mails, so let me post an update. I am planning to approve PEP 590 with the following changes, if Mark doesn't object to them: * https://github.com/python/peps/pull/1064 (Mark the main API as private to allow changes in Python 3.9) * https://github.com/python/peps/pull/1066 (Use size_t for "number of arguments + flag") The resulting text, for reference: PEP: 590 Title: Vectorcall: a fast calling protocol for CPython Author: Mark Shannon <mark@hotpy.org>, Jeroen Demeyer <J.Demeyer@UGent.be> BDFL-Delegate: Petr Viktorin <encukou@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 29-Mar-2019 Python-Version: 3.8 Post-History: Abstract ======== This PEP introduces a new C API to optimize calls of objects. It introduces a new "vectorcall" protocol and calling convention. This is based on the "fastcall" convention, which is already used internally by CPython. The new features can be used by any user-defined extension class. Most of the new API is private in CPython 3.8. The plan is to finalize semantics and make it public in Python 3.9. **NOTE**: This PEP deals only with the Python/C API, it does not affect the Python language or standard library. Motivation ========== The choice of a calling convention impacts the performance and flexibility of code on either side of the call. Often there is tension between performance and flexibility. The current ``tp_call`` [2]_ calling convention is sufficiently flexible to cover all cases, but its performance is poor. The poor performance is largely a result of having to create intermediate tuples, and possibly intermediate dicts, during the call. This is mitigated in CPython by including special-case code to speed up calls to Python and builtin functions. Unfortunately, this means that other callables such as classes and third party extension objects are called using the slower, more general ``tp_call`` calling convention. This PEP proposes that the calling convention used internally for Python and builtin functions is generalized and published so that all calls can benefit from better performance. The new proposed calling convention is not fully general, but covers the large majority of calls. It is designed to remove the overhead of temporary object creation and multiple indirections. Another source of inefficiency in the ``tp_call`` convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class ``cls``, at least one intermediate object is created for each call in the sequence ``type.__call__``, ``cls.__new__``, ``cls.__init__``. This PEP proposes an interface for use by extension modules. Such interfaces cannot effectively be tested, or designed, without having the consumers in the loop. For that reason, we provide private (underscore-prefixed) names. The API may change (based on consumer feedback) in Python 3.9, where we expect it to be finalized, and the underscores removed. Specification ============= The function pointer type ------------------------- Calls are made through a function pointer taking the following parameters: * ``PyObject *callable``: The called object * ``PyObject *const *args``: A vector of arguments * ``size_t nargs``: The number of arguments plus the optional flag ``PY_VECTORCALL_ARGUMENTS_OFFSET`` (see below) * ``PyObject *kwnames``: Either ``NULL`` or a tuple with the names of the keyword arguments This is implemented by the function pointer type: ``typedef PyObject *(*vectorcallfunc)(PyObject *callable, PyObject *const *args, size_t nargs, PyObject *kwnames);`` Changes to the ``PyTypeObject`` struct -------------------------------------- The unused slot ``printfunc tp_print`` is replaced with ``tp_vectorcall_offset``. It has the type ``Py_ssize_t``. A new ``tp_flags`` flag is added, ``_Py_TPFLAGS_HAVE_VECTORCALL``, which must be set for any class that uses the vectorcall protocol. If ``_Py_TPFLAGS_HAVE_VECTORCALL`` is set, then ``tp_vectorcall_offset`` must be a positive integer. It is the offset into the object of the vectorcall function pointer of type ``vectorcallfunc``. This pointer may be ``NULL``, in which case the behavior is the same as if ``_Py_TPFLAGS_HAVE_VECTORCALL`` was not set. The ``tp_print`` slot is reused as the ``tp_vectorcall_offset`` slot to make it easier for for external projects to backport the vectorcall protocol to earlier Python versions. In particular, the Cython project has shown interest in doing that (see https://mail.python.org/pipermail/python-dev/2018-June/153927.html). Descriptor behavior ------------------- One additional type flag is specified: ``Py_TPFLAGS_METHOD_DESCRIPTOR``. ``Py_TPFLAGS_METHOD_DESCRIPTOR`` should be set if the callable uses the descriptor protocol to create a bound method-like object. This is used by the interpreter to avoid creating temporary objects when calling methods (see ``_PyObject_GetMethod`` and the ``LOAD_METHOD``/``CALL_METHOD`` opcodes). Concretely, if ``Py_TPFLAGS_METHOD_DESCRIPTOR`` is set for ``type(func)``, then: - ``func.__get__(obj, cls)(*args, **kwds)`` (with ``obj`` not None) must be equivalent to ``func(obj, *args, **kwds)``. - ``func.__get__(None, cls)(*args, **kwds)`` must be equivalent to ``func(*args, **kwds)``. There are no restrictions on the object ``func.__get__(obj, cls)``. The latter is not required to implement the vectorcall protocol. The call -------- The call takes the form ``((vectorcallfunc)(((char *)o)+offset))(o, args, n, kwnames)`` where ``offset`` is ``Py_TYPE(o)->tp_vectorcall_offset``. The caller is responsible for creating the ``kwnames`` tuple and ensuring that there are no duplicates in it. ``n`` is the number of postional arguments plus possibly the ``PY_VECTORCALL_ARGUMENTS_OFFSET`` flag. PY_VECTORCALL_ARGUMENTS_OFFSET ------------------------------ The flag ``PY_VECTORCALL_ARGUMENTS_OFFSET`` should be added to ``n`` if the callee is allowed to temporarily change ``args[-1]``. In other words, this can be used if ``args`` points to argument 1 in the allocated vector. The callee must restore the value of ``args[-1]`` before returning. Whenever they can do so cheaply (without allocation), callers are encouraged to use ``PY_VECTORCALL_ARGUMENTS_OFFSET``. Doing so will allow callables such as bound methods to make their onward calls cheaply. The bytecode interpreter already allocates space on the stack for the callable, so it can use this trick at no additional cost. See [3]_ for an example of how ``PY_VECTORCALL_ARGUMENTS_OFFSET`` is used by a callee to avoid allocation. For getting the actual number of arguments from the parameter ``n``, the macro ``PyVectorcall_NARGS(n)`` must be used. This allows for future changes or extensions. New C API and changes to CPython ================================ The following functions or macros are added to the C API: - ``PyObject *_PyObject_Vectorcall(PyObject *obj, PyObject *const *args, size_t nargs, PyObject *keywords)``: Calls ``obj`` with the given arguments. Note that ``nargs`` may include the flag ``PY_VECTORCALL_ARGUMENTS_OFFSET``. The actual number of positional arguments is given by ``PyVectorcall_NARGS(nargs)``. The argument ``keywords`` is a tuple of keyword names or ``NULL``. An empty tuple has the same effect as passing ``NULL``. This uses either the vectorcall protocol or ``tp_call`` internally; if neither is supported, an exception is raised. - ``PyObject *PyVectorcall_Call(PyObject *obj, PyObject *tuple, PyObject *dict)``: Call the object (which must support vectorcall) with the old ``*args`` and ``**kwargs`` calling convention. This is mostly meant to put in the ``tp_call`` slot. - ``Py_ssize_t PyVectorcall_NARGS(size_t nargs)``: Given a vectorcall ``nargs`` argument, return the actual number of arguments. Currently equivalent to ``nargs & ~PY_VECTORCALL_ARGUMENTS_OFFSET``. Subclassing ----------- Extension types inherit the type flag ``_Py_TPFLAGS_HAVE_VECTORCALL`` and the value ``tp_vectorcall_offset`` from the base class, provided that they implement ``tp_call`` the same way as the base class. Additionally, the flag ``Py_TPFLAGS_METHOD_DESCRIPTOR`` is inherited if ``tp_descr_get`` is implemented the same way as the base class. Heap types never inherit the vectorcall protocol because that would not be safe (heap types can be changed dynamically). This restriction may be lifted in the future, but that would require special-casing ``__call__`` in ``type.__setattribute__``. Finalizing the API ================== The underscore in the names ``_PyObject_Vectorcall`` and ``_Py_TPFLAGS_HAVE_VECTORCALL`` indicates that this API may change in minor Python versions. When finalized (which is planned for Python 3.9), they will be renamed to ``PyObject_Vectorcall`` and ``Py_TPFLAGS_HAVE_VECTORCALL``. The old underscore-prefixed names will remain available as aliases. The new API will be documented as normal, but will warn of the above. Semantics for the other names introduced in this PEP (``PyVectorcall_NARGS``, ``PyVectorcall_Call``, ``Py_TPFLAGS_METHOD_DESCRIPTOR``, ``PY_VECTORCALL_ARGUMENTS_OFFSET``) are final. Internal CPython changes ======================== Changes to existing classes --------------------------- The ``function``, ``builtin_function_or_method``, ``method_descriptor``, ``method``, ``wrapper_descriptor``, ``method-wrapper`` classes will use the vectorcall protocol (not all of these will be changed in the initial implementation). For ``builtin_function_or_method`` and ``method_descriptor`` (which use the ``PyMethodDef`` data structure), one could implement a specific vectorcall wrapper for every existing calling convention. Whether or not it is worth doing that remains to be seen. Using the vectorcall protocol for classes ----------------------------------------- For a class ``cls``, creating a new instance using ``cls(xxx)`` requires multiple calls. At least one intermediate object is created for each call in the sequence ``type.__call__``, ``cls.__new__``, ``cls.__init__``. So it makes a lot of sense to use vectorcall for calling classes. This really means implementing the vectorcall protocol for ``type``. Some of the most commonly used classes will use this protocol, probably ``range``, ``list``, ``str``, and ``type``. The ``PyMethodDef`` protocol and Argument Clinic ------------------------------------------------ Argument Clinic [4]_ automatically generates wrapper functions around lower-level callables, providing safe unboxing of primitive types and other safety checks. Argument Clinic could be extended to generate wrapper objects conforming to the new ``vectorcall`` protocol. This will allow execution to flow from the caller to the Argument Clinic generated wrapper and thence to the hand-written code with only a single indirection. Third-party extension classes using vectorcall ============================================== To enable call performance on a par with Python functions and built-in functions, third-party callables should include a ``vectorcallfunc`` function pointer, set ``tp_vectorcall_offset`` to the correct value and add the ``_Py_TPFLAGS_HAVE_VECTORCALL`` flag. Any class that does this must implement the ``tp_call`` function and make sure its behaviour is consistent with the ``vectorcallfunc`` function. Setting ``tp_call`` to ``PyVectorcall_Call`` is sufficient. Performance implications of these changes ========================================= This PEP should not have much impact on the performance of existing code (neither in the positive nor the negative sense). It is mainly meant to allow efficient new code to be written, not to make existing code faster. Nevertheless, this PEP optimizes for ``METH_FASTCALL`` functions. Performance of functions using ``METH_VARARGS`` will become slightly worse. Stable ABI ========== Nothing from this PEP is added to the stable ABI (PEP 384). Alternative Suggestions ======================= bpo-29259 --------- PEP 590 is close to what was proposed in bpo-29259 [#bpo29259]_. The main difference is that this PEP stores the function pointer in the instance rather than in the class. This makes more sense for implementing functions in C, where every instance corresponds to a different C function. It also allows optimizing ``type.__call__``, which is not possible with bpo-29259. PEP 576 and PEP 580 ------------------- Both PEP 576 and PEP 580 are designed to enable 3rd party objects to be both expressive and performant (on a par with CPython objects). The purpose of this PEP is provide a uniform way to call objects in the CPython ecosystem that is both expressive and as performant as possible. This PEP is broader in scope than PEP 576 and uses variable rather than fixed offset function-pointers. The underlying calling convention is similar. Because PEP 576 only allows a fixed offset for the function pointer, it would not allow the improvements to any objects with constraints on their layout. PEP 580 proposes a major change to the ``PyMethodDef`` protocol used to define builtin functions. This PEP provides a more general and simpler mechanism in the form of a new calling convention. This PEP also extends the ``PyMethodDef`` protocol, but merely to formalise existing conventions. Other rejected approaches ------------------------- A longer, 6 argument, form combining both the vector and optional tuple and dictionary arguments was considered. However, it was found that the code to convert between it and the old ``tp_call`` form was overly cumbersome and inefficient. Also, since only 4 arguments are passed in registers on x64 Windows, the two extra arguments would have non-neglible costs. Removing any special cases and making all calls use the ``tp_call`` form was also considered. However, unless a much more efficient way was found to create and destroy tuples, and to a lesser extent dictionaries, then it would be too slow. Acknowledgements ================ Victor Stinner for developing the original "fastcall" calling convention internally to CPython. This PEP codifies and extends his work. References ========== .. [#bpo29259] Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects, https://bugs.python.org/issue29259 .. [2] tp_call/PyObject_Call calling convention https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_call .. [3] Using PY_VECTORCALL_ARGUMENTS_OFFSET in callee https://github.com/markshannon/cpython/blob/vectorcall-minimal/Objects/class... .. [4] Argument Clinic https://docs.python.org/3/howto/clinic.html Reference implementation ======================== A minimal implementation can be found at https://github.com/markshannon/cpython/tree/vectorcall-minimal Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

On 3/24/2019 8:21 AM, Nick Coghlan wrote:
Where do we discuss these? If a delegate has a provisional view, it might help focus discussion if that were known.
* PEP 499: Binding "-m" executed modules under their module name as well as `__main__`
My brief response: +1 unless there is a good reason not. There have been multiple double module problems reported on python-list and likely stackoverflow. And would there be any impact on circular imports? -- Terry Jan Reedy

On 24Mar2019 17:02, Terry Reedy <tjreedy@udel.edu> wrote:
There turn out to be some subtle side effects. The test suite turned up one (easily fixed) in pdb, but there are definitely some more things to investigate. Nick has pointed out pickle and the "python -i" option. I'm digging into these. (Naturally, I have _never_ before used the pdb or pickle modules, or the -i option :-)
Well, by binding the -m module to both __main__ and its name as denoted on the command line one circular import is directly short circuited. Aside from the -m module itself, I don't think there should be any other direct effect on circular imports. Did you have a specific scenario in mind? Cheers, Cameron Simpson <cs@cskk.id.au>

On 3/24/2019 7:00 PM, Cameron Simpson wrote:
I was thinking about IDLE and its tangled web of circular inports, but I am now convinced that this change will not affect it. Indeed, idlelib/pyshell.py already implements idea of the proposal, ending with if __name__ == "__main__": sys.modules['pyshell'] = sys.modules['__main__'] main() (It turns out that this fails for other reasons, which I am looking into. The current recommendation is to start IDLE by runing any of __main__.py (via python -m idlelib), idle.py, idlew.py, or idle.bat.) -- Terry Jan Reedy

On 3/24/2019 10:01 PM, Terry Reedy wrote:
On 3/24/2019 7:00 PM, Cameron Simpson wrote:
After more investigation, I realized that to stop having duplicate modulue: 1. The alias should be 'idlelib.pyshell', not 'pyshell', at least when imports are all absolute. 2. It should be done at the top of the file, before the import of modules that import pyshell. If I run python f:/dev/3x/lib/idlelib/pyshell.py, the PEP patch would have to notice that pyshell is a module within idlelib and alias '__main__' to 'idlelib.pyshell', not 'pyshell'. Would the same be true if within-package import were all relative?
(It turns out that this fails for other reasons, which I am looking into.
Since starting IDLE with pyshell once worked in the past, it appears to be because the startup command for run.py was outdated. Will fix. -- Terry Jan Reedy

On 24Mar2019 23:22, Terry Reedy <tjreedy@udel.edu> wrote:
The PEP499 patch effectively uses __main__.__spec__.name for the name of the alias. Does that simplify your issue? The current PR is here if you want to look at it: https://github.com/python/cpython/pull/12490
2. It should be done at the top of the file, before the import of modules that import pyshell.
Hmm, if PEP499 comes in you shouldn't need to do this at all. If PEP499 gets delayed or rejected I guess you're supporting this without it. Yes, you'll want to do it before any other imports happen (well, as you say, before any which import pyshell). What about (untested): if __name__ == '__main__': if __spec__.name not in sys.modules: sys.modules[__spec__.name] = sys.modules['__main__'] as a forward compatible setup?
I think so because we're using .__spec__.name, which I though was post import name resolution. Testing in my PEP499 branch: Test 1: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> ./python.exe -i Lib/idlelib/pyshell.py Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module '__main__' (<_frozen_importlib_external.SourceFileLoader object at 0x1088e6040>)> >>> sys.modules['pyshell'] <module '__main__' (<_frozen_importlib_external.SourceFileLoader object at 0x1088e6040>)> >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> So pyshell and idlelib.pyshell are distinct here. __main__ and pyshell are the same module, courtesy of your sys.modules assignment at the bottom of pyshell.py. Test 3 below will be with that commented out. Test 2: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> PYTHONPATH=$PWD/Lib ./python.exe -i -m idlelib.pyshell Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> id(sys.modules['__main__']) 4551072712 >>> id(sys.modules['pyshell']) 4551072712 >>> id(sys.modules['idlelib.pyshell']) 4551072712 So this has __main__ and idlelib.pyshell the same module from the PEP499 patch and pyshell also the same from your sys.modules assignment. Test 3, with the pyshell.py sys.modules assignment commented out: [~/src/cpython-cs@github(git:PEP499-cs)]fleet*> PYTHONPATH=$PWD/Lib ./python.exe -i -m idlelib.pyshell Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'run' >>> sys.modules['__main__'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> sys.modules['pyshell'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'pyshell' >>> sys.modules['idlelib.pyshell'] <module 'idlelib.pyshell' from '/Users/cameron/src/cpython-cs@github/Lib/idlelib/pyshell.py'> >>> id(sys.modules['__main__']) 4552379336 >>> id(sys.modules['idlelib.pyshell']) 4552379336 Here we've got __main__ and idlelib.pyshell the same module and no 'pyshell' in sys.modules. I don't think I understand your "relative import" scenario. Cheers, Cameron Simpson <cs@cskk.id.au>

On 3/25/2019 12:27 AM, Cameron Simpson wrote:
The new test passes on Win10.
When I start pyshell in my master repository directory on windows with python -m idlelib.pyshell __spec__.name is 'idlelib.pyshell, which I currently hard-coded. When I start with what should be equivalent python f:/dev/3x/lib/idlelib/pyshell.py __spec__ is None and __spec__.name an attribute error.
You must be doing something different when __spec__ is None ;-). I tested the patch and it does not raise AttributeError with the command above.
This is because of an obsolete 'command = ...' around 420. The if line is correct always and the if/then not needed.
I verified that the module was being executed twice by putting print('running') at the top. __main__ and pyshell
are the same module, courtesy of your sys.modules assignment at the bottom of pyshell.py.
Obsolete and removed. Test 3 below will be with that commented out.
I don't think I understand your "relative import" scenario.
If files other that pyshell used relative 'import ./pyshell' instead of absolute 'import idlelib.pyshell', would the sys.modules key still be 'idlelib.pyshell' or 'pyshell'? Which is to ask, would the alias needed to avoid a second pyshell module still be 'idlelib.pyshell' or 'pyshell'?

On 25Mar2019 03:52, Terry Reedy <tjreedy@udel.edu> wrote:
Um, yes. I presume that since no "import" has been done, there's no import spec (.__spec__). Clearly the above needs to accomodate this, possibly with a fallback guess. Is sniffing the end components of __file__ at all sane? ending in idlelib/pyshell.py or pyshell.py? Or is that just getting baroque? I don't think these are strictly the same from some kind of purist viewpoint: the path might be anything - _is_ it reasonable to suppose that it has a module name (== importable/finding through the import path)?
Indeed. I may have fudged a bit when I said "The PEP499 patch effectively uses __main__.__spec__.name". It modifies runpy.py's _run_module_as_main function, and that is called for the "python -m module_name" invocation, so it can get the module spec because it has a module name. So the patch doesn't have to cope with __spec__ being None. As you say, __spec__ is None for "python path/to/file.py" so __spec__ isn't any use there. Apologies. [...]
Ok. As I understand it Python 3 imports are absolute: without a leading dot a name is absolute, so "import pyshell" should install sys.module['pyshell'] _provided_ that 'pyshell' can be found in the module search path. Conversely, an "import .pyshell" is an import relative to the current module's package name, equivalent to an import of the absolute path "package.name.pyshell", for whatever the package name is. So (a) you can only import '.pyshell' from within a package containing a 'pyshell.py' file and (b) you can't import import '.pyshell' if you're not in a package. I stuffed a "test2.py" into the local idlelib like this: import sys print("running", __file__, __name__) print(repr(sorted(sys.modules))) print(repr(sys.paht)) from pyshell import idle_showwarning print(repr(sorted(sys.modules))) and fiddled with the "from pyshell import idle_showwarning" line. (I'm presuming this is what you have in mind, since "import ./pyshell" elicits a syntax error.) Using "./python.exe -m idlelib.test2": Plain "pyshell" gets an ImportError - no such module. Using ".pyshell" imports the pyshell module as "idlelib.pyshell" in sys.modules. Which was encouraging until I went "./python.exe Lib/idlelib/test2.py". This puts Lib/idlelib (as an absolute path) at the start of sys.path. A plain "pyshell" import works and installs sys.modules['pyshell']. Conversely, trying the ".pyshell" import gets: ModuleNotFoundError: No module named '__main__.pyshell'; '__main__' is not a package So we can get 'pyshell' or 'idlelib.pyshell' into sys.modules depending how we invoke python. HOWEVER, if you're importing the 'pyshell' from idlelib _as found in the module search path_, whether absolutely as 'idlelib.pyshell' or relatives as '.pyshell' from within the idlelib package, you should always get 'idlelib.pyshell' in the sys.modules map. And I don't think you should need to worry about a circular import importing some top level name 'pyshell' because that's not using the idlelib package, so I'd argue it isn't your problem. Thoughts? Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, 25 Mar 2019 at 20:34, Cameron Simpson <cs@cskk.id.au> wrote:
Directly executing files from inside Python packages is explicitly unsupported, and nigh guaranteed to result in a broken import setup, as relative imports won't work, and absolute imports will most likely result in a second copy of the script module getting loaded. The problem is that __main__ always thinks it is a top-level module for directly executed scripts - it needs the package structure information from the "-m" switch to learn otherwise. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (9)
-
Brett Cannon
-
Cameron Simpson
-
Jeroen Demeyer
-
Mark Shannon
-
Nick Coghlan
-
Petr Viktorin
-
Petr Viktorin
-
Stefan Behnel
-
Terry Reedy