Improve CPython tracing performance
Hi all, Right now, when a debugger is active, the number of local variables can affect the tracing speed quite a lot. For instance, having tracing setup in a program such as the one below takes 4.64 seconds to run, yet, changing all the variables to have the same name -- i.e.: change all assignments to `a = 1` (such that there's only a single variable in the namespace), it takes 1.47 seconds (in my machine)... the higher the number of variables, the slower the tracing becomes. ``` import time t = time.time() def call(): a = 1 b = 1 c = 1 d = 1 e = 1 f = 1 def noop(frame, event, arg): return noop import sys sys.settrace(noop) for i in range(1_000_000): call() print('%.2fs' % (time.time() - t,)) ``` This happens because `PyFrame_FastToLocalsWithError` and `PyFrame_LocalsToFast` are called inside the `call_trampoline` ( https://github.com/python/cpython/blob/master/Python/sysmodule.c#L946). So, I'd like to simply remove those calls. Debuggers can call `PyFrame_LocalsToFast` when needed -- otherwise mutating non-current frames doesn't work anyways. As a note, pydevd already has such a call: https://github.com/fabioz/PyDev.Debugger/blob/0d4d210f01a1c0a8647178b2e665b5... and PyPy also has a counterpart. As for `PyFrame_FastToLocalsWithError`, I don't really see any reason to call it at all. i.e.: something as the code below prints the `a` variable from the `main()` frame regardless of that and I checked all pydevd tests and nothing seems to be affected (it seems that accessing f_locals already does this: https://github.com/python/cpython/blob/cb9879b948a19c9434316f8ab6aba9c4601a8..., so, I don't see much reason to call it at all). ``` def call(): import sys frame = sys._getframe() print(frame.f_back.f_locals) def main(): a = 1 call() if __name__ == '__main__': main() ``` Does anyone see any issue with this? If it's non controversial, is a PEP needed or just an issue to track it would be enough to remove those 2 lines? Thanks, Fabio
Le jeu. 29 oct. 2020 à 13:02, Fabio Zadrozny <fabiofz@gmail.com> a écrit :
Debuggers can call `PyFrame_LocalsToFast` when needed -- otherwise mutating non-current frames doesn't work anyways. As a note, pydevd already has such a call: https://github.com/fabioz/PyDev.Debugger/blob/0d4d210f01a1c0a8647178b2e665b5... and PyPy also has a counterpart.
Hum, if a trace or profile function is written in Python, reading frame.f_locals does call PyFrame_FastToLocalsWithError(). So a Python debugger/profiler would be ok with your code. For a debugger/profiler written in C, it would be a backward incompatible change. I agree that it would be reasonable to require it to call PyFrame_FastToLocalsWithError().
If it's non controversial, is a PEP needed or just an issue to track it would be enough to remove those 2 lines?
Incompatible changes should be well documented in What's New in Python 3.10. In this case, I don't think that a deprecation period is needed. Just open an issue. Please post the URL to your issue in reply to your email. It's even better if you can write a PR to implement your idea ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Thu, Oct 29, 2020 at 9:45 AM Victor Stinner <vstinner@python.org> wrote:
Le jeu. 29 oct. 2020 à 13:02, Fabio Zadrozny <fabiofz@gmail.com> a écrit :
Debuggers can call `PyFrame_LocalsToFast` when needed -- otherwise mutating non-current frames doesn't work anyways. As a note, pydevd already has such a call: https://github.com/fabioz/PyDev.Debugger/blob/0d4d210f01a1c0a8647178b2e665b5... and PyPy also has a counterpart.
Hum, if a trace or profile function is written in Python, reading frame.f_locals does call PyFrame_FastToLocalsWithError(). So a Python debugger/profiler would be ok with your code.
For a debugger/profiler written in C, it would be a backward incompatible change. I agree that it would be reasonable to require it to call PyFrame_FastToLocalsWithError().
If it's non controversial, is a PEP needed or just an issue to track it would be enough to remove those 2 lines?
Incompatible changes should be well documented in What's New in Python 3.10. In this case, I don't think that a deprecation period is needed.
Just open an issue. Please post the URL to your issue in reply to your email. It's even better if you can write a PR to implement your idea ;-)
Ok, I've created https://bugs.python.org/issue42197 to track it. -- Fabio
On Fri, 30 Oct 2020 at 00:19, Fabio Zadrozny <fabiofz@gmail.com> wrote:
On Thu, Oct 29, 2020 at 9:45 AM Victor Stinner <vstinner@python.org> wrote:
If it's non controversial, is a PEP needed or just an issue to track it would be enough to remove those 2 lines?
Incompatible changes should be well documented in What's New in Python 3.10. In this case, I don't think that a deprecation period is needed.
Just open an issue. Please post the URL to your issue in reply to your email. It's even better if you can write a PR to implement your idea ;-)
Removing those calls would require a PEP, as it would break all sorts of tools in cases that currently work correctly.
Ok, I've created https://bugs.python.org/issue42197 to track it.
Please also have a look at PEP 558 and its draft reference implementation at https://github.com/python/cpython/pull/3640 The way these trampoline calls currently work isn't just slow, it's actually broken in various ways, and changing them to use a write-through proxy instead of a dict-based snapshot means that the cost of producing those dict-based snapshots simply because tracing is turned on will go away. The PEP itself didn't seem to be particularly controversial (at least in its current form - earlier versions drew more objections), but there's a bunch of preparatory work that needs to be done before it could seriously be submitted for final review (specifically: the write-through proxy isn't actually implementing the full mutable mapping API. In order for it to do that without excessive code duplication, the helper functions already written for ordered dictionaries needed to moved out to a separate linkable module so that the new write-through proxy can reuse them without taking a separate copy of them) Cheers, Nick. P.S. If that sounds like a request for help, that's because it is ;) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Oct 30, 2020 at 7:02 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
On Fri, 30 Oct 2020 at 00:19, Fabio Zadrozny <fabiofz@gmail.com> wrote:
On Thu, Oct 29, 2020 at 9:45 AM Victor Stinner <vstinner@python.org> wrote:
If it's non controversial, is a PEP needed or just an issue to track
it would be enough to remove those 2 lines?
Incompatible changes should be well documented in What's New in Python 3.10. In this case, I don't think that a deprecation period is needed.
Just open an issue. Please post the URL to your issue in reply to your email. It's even better if you can write a PR to implement your idea ;-)
Removing those calls would require a PEP, as it would break all sorts of tools in cases that currently work correctly.
Ok, I've created https://bugs.python.org/issue42197 to track it.
Please also have a look at PEP 558 and its draft reference implementation at https://github.com/python/cpython/pull/3640
The way these trampoline calls currently work isn't just slow, it's actually broken in various ways, and changing them to use a write-through proxy instead of a dict-based snapshot means that the cost of producing those dict-based snapshots simply because tracing is turned on will go away.
The PEP itself didn't seem to be particularly controversial (at least in its current form - earlier versions drew more objections), but there's a bunch of preparatory work that needs to be done before it could seriously be submitted for final review (specifically: the write-through proxy isn't actually implementing the full mutable mapping API. In order for it to do that without excessive code duplication, the helper functions already written for ordered dictionaries needed to moved out to a separate linkable module so that the new write-through proxy can reuse them without taking a separate copy of them)
Hi Nick! As a note, the current implementation does allow debuggers to mutate frame locals -- as long as they understand that they need to call ` PyFrame_LocalsToFast ` when doing such a change -- potentially using ctypes (I'm just mentioning this because PEP 558 seems to imply this isn't possible). i.e.: Debuggers already *must* call ` PyFrame_LocalsToFast ` if locals from a frame which is not the current frame are being mutated, so, as far as I can see a debugger is already broken if it isn't doing that -- some years ago I even thought about exposing it in the frame API: https://bugs.python.org/issue1654367, but in the end accessing it through the C-API through ctypes does get the job done, debugger authors just need to be aware of it -- PyPy also has a counterpart mentioned in that issue. I agree that having f_locals be a proxy that does everything transparently would be better, but unfortunately I don't currently have the time available to help there... in your opinion, just removing those calls as I proposed (requiring that debuggers call `PyFrame_LocalsToFast`) would be acceptable? If you think it is, I'll proceed on creating the PEP, otherwise I'll probably just drop it until f_locals becomes a proxy (in which case I'd expect the `PyFrame_FastToLocalsWithError` and `PyFrame_LocalsToFast` to be removed in the same commit which converts f_locals to a proxy). Cheers, Fabio
On Fri, 30 Oct 2020 at 20:52, Fabio Zadrozny <fabiofz@gmail.com> wrote:
On Fri, Oct 30, 2020 at 7:02 AM Nick Coghlan <ncoghlan@gmail.com> wrote: As a note, the current implementation does allow debuggers to mutate frame locals -- as long as they understand that they need to call ` PyFrame_LocalsToFast ` when doing such a change -- potentially using ctypes (I'm just mentioning this because PEP 558 seems to imply this isn't possible).
If that's referring to the sentence that ends with ".. without even reliably enabling the desired functionality of allowing debuggers like pdb to mutate local variables.", the critical word there is "reliably". The existing API and implementation does allow you to change local variables from trace functions (especially if you use ctypes to access the C API rather than sticking solely to the Python level API), it just also messes up your application state sometimes.
i.e.: Debuggers already *must* call ` PyFrame_LocalsToFast ` if locals from a frame which is not the current frame are being mutated, so, as far as I can see a debugger is already broken if it isn't doing that -- some years ago I even thought about exposing it in the frame API: https://bugs.python.org/issue1654367, but in the end accessing it through the C-API through ctypes does get the job done, debugger authors just need to be aware of it -- PyPy also has a counterpart mentioned in that issue.
It isn't primarily debuggers that are a concern for backwards compatibility, it's trace functions manipulating the state of the frame being traced. In the current implementation, that feature relies on those FastToLocals and LocalsToFast conversion calls to work.
I agree that having f_locals be a proxy that does everything transparently would be better, but unfortunately I don't currently have the time available to help there... in your opinion, just removing those calls as I proposed (requiring that debuggers call `PyFrame_LocalsToFast`) would be acceptable? If you think it is, I'll proceed on creating the PEP, otherwise I'll probably just drop it until f_locals becomes a proxy (in which case I'd expect the `PyFrame_FastToLocalsWithError` and `PyFrame_LocalsToFast` to be removed in the same commit which converts f_locals to a proxy).
Checking the PR branch, the write-through proxy meant I was able to get rid of the call to PyFrame_LocalsToFast, but I couldn't get rid of the implicit snapshot refresh due to C API compatibility concerns: https://github.com/python/cpython/pull/3640/files#diff-a3a5c73931235f7f344c0... Assuming the PEP is eventually accepted, though, then it would be reasonable to subsequently deprecate that implicit refresh, and say that any trace function calling such APIs needs to make its own call to PyLocals_RefreshViews() first. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le ven. 30 oct. 2020 à 11:02, Nick Coghlan <ncoghlan@gmail.com> a écrit :
Ok, I've created https://bugs.python.org/issue42197 to track it.
Please also have a look at PEP 558 and its draft reference implementation at https://github.com/python/cpython/pull/3640
I don't think that the PEP 558 and bpo-42197 are incompatible. Debuggers and profilers usually only care of specific frames or function calls (ex: 10% of function calls or even a single function call in a whole application). The problem is how to make them as efficient as possible for "no operation" calls, when they don't care about the current frame. Avoiding PyFrame_FastToLocalsWithError() to enter the debugger/profile and avoiding PyFrame_LocalsToFast() on exit sounds a simple and practical solution. By the way, I created https://bugs.python.org/issue40421 to prepare the C API to make the PyFrameObject structure opaque. Once it will be opaque, we will have more freedom on the API exposed to inspect frame locals. IMO it's a good idea to require function calls to inspect frame locals, and not let developers think that PyFrameObject.f_locals is always up-to-date and can always be modified. I dislike PyFrame_FastToLocalsWithError() and PyFrame_LocalsToFast() names. Maybe we should have an even more "explicit" API. Example (without error handling): --- // Implementation: Call PyFrame_FastToLocalsWithError() and returns PyFrameObject.f_locals locals = PyFrame_GetLocals(); // ... the debugger modifies some local variables in 'locals' dict ... // Implementation: Set PyFrameObject.f_locals and call PyFrame_LocalsToFast() PyFrame_SetLocals(locals); --- The good news is that PyFrame_GetLocals() and PyFrame_SetLocals() can easily be reimplemented in Python 3.9 for my compatibility header file: https://github.com/pythoncapi/pythoncapi_compat Such API avoids a complex proxy and simply reuses a regular dict object (exiting PyFrameObject.f_locals). Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Fri, 30 Oct 2020 at 23:34, Victor Stinner <vstinner@python.org> wrote:
Le ven. 30 oct. 2020 à 11:02, Nick Coghlan <ncoghlan@gmail.com> a écrit :
Ok, I've created https://bugs.python.org/issue42197 to track it.
Please also have a look at PEP 558 and its draft reference implementation at https://github.com/python/cpython/pull/3640
I don't think that the PEP 558 and bpo-42197 are incompatible.
Debuggers and profilers usually only care of specific frames or function calls (ex: 10% of function calls or even a single function call in a whole application). The problem is how to make them as efficient as possible for "no operation" calls, when they don't care about the current frame. Avoiding PyFrame_FastToLocalsWithError() to enter the debugger/profile and avoiding PyFrame_LocalsToFast() on exit sounds a simple and practical solution.
Aye, I agree. I just don't think we can remove those implicit calls without preparing a replacement API first.
By the way, I created https://bugs.python.org/issue40421 to prepare the C API to make the PyFrameObject structure opaque. Once it will be opaque, we will have more freedom on the API exposed to inspect frame locals.
IMO it's a good idea to require function calls to inspect frame locals, and not let developers think that PyFrameObject.f_locals is always up-to-date and can always be modified.
Aye, if direct access to the PyFrameObject.f_locals struct field goes away, then the cache refresh can be added to the accessor APIs, just as PEP 558 proposes doing for Python code.
The good news is that PyFrame_GetLocals() and PyFrame_SetLocals() can easily be reimplemented in Python 3.9 for my compatibility header file: https://github.com/pythoncapi/pythoncapi_compat
Such API avoids a complex proxy and simply reuses a regular dict object (exiting PyFrameObject.f_locals).
Unfortunately, cell variables mean that there's no way to make snapshot-with-writeback logic consistently correct in the presence of generators and coroutines (see https://www.python.org/dev/peps/pep-0558/#resolving-the-issues-with-tracing-... for the gist of that problem). The early iterations of PEP 558 worked that way (like you, I wanted to avoid having to actually implement the write-through proxy), but Nathaniel kept finding holes where bugs like bpo-30744 could still happen (hence the acknowledgement in the PEP), so I eventually conceded defeat and accepted that the proxy was the only approach that could ever be truly correct. The core of the write-through proxy implementation isn't actually too bad: https://github.com/python/cpython/blob/ed6e53be7794ec94924b69cd46d5b009633c6... The annoying parts are the outright copy-and-paste of odict code at the end of the file, and the few remaining mutable mapping methods that still raise NotImplementedError rather than doing what you'd expect. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Reply to an old thread. On Sat, Oct 31, 2020 at 8:02 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
Debuggers and profilers usually only care of specific frames or function calls (ex: 10% of function calls or even a single function call in a whole application). The problem is how to make them as efficient as possible for "no operation" calls, when they don't care about the current frame. Avoiding PyFrame_FastToLocalsWithError() to enter the debugger/profile and avoiding PyFrame_LocalsToFast() on exit sounds a simple and practical solution.
Aye, I agree. I just don't think we can remove those implicit calls without preparing a replacement API first.
Again, I don't think that it's incompatible. We can enforce calling PyFrame_FastToLocalsWithError() at enter and PyFrame_LocalsToFast() at exit for now, and enhance the API later.
Unfortunately, cell variables mean that there's no way to make snapshot-with-writeback logic consistently correct in the presence of generators and coroutines (...)
I do understand that there are corner cases where it doesn't work, ok. But https://bugs.python.org/issue42197 is a simple and ready to merge solution which should optimize profilers and debuggers for the most common case. To be clear: currently, PyFrame_FastToLocalsWithError() is called at enter and PyFrame_LocalsToFast() is called at exit. So asking debuggers/profilers to call them explicitly doesn't make the situation worse (nor better ;-)) for generators/coroutines, it would be exactly the same behavior. It's just an optimization. The PEP 558 is being discussed for 5 years and still a draft. I don't think that it should hold bpo-42197 optimization. Victor -- Night gathers, and now my watch begins. It shall not end until my death.
On Wed, 20 Jan 2021 at 19:22, Victor Stinner <vstinner@python.org> wrote:
Reply to an old thread.
On Sat, Oct 31, 2020 at 8:02 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
Debuggers and profilers usually only care of specific frames or function calls (ex: 10% of function calls or even a single function call in a whole application). The problem is how to make them as efficient as possible for "no operation" calls, when they don't care about the current frame. Avoiding PyFrame_FastToLocalsWithError() to enter the debugger/profile and avoiding PyFrame_LocalsToFast() on exit sounds a simple and practical solution.
Aye, I agree. I just don't think we can remove those implicit calls without preparing a replacement API first.
Again, I don't think that it's incompatible. We can enforce calling PyFrame_FastToLocalsWithError() at enter and PyFrame_LocalsToFast() at exit for now, and enhance the API later.
PEP 558 makes `PyFrame_LocalsToFast()` raise an exception, so the two approaches definitely aren't compatible :)
To be clear: currently, PyFrame_FastToLocalsWithError() is called at enter and PyFrame_LocalsToFast() is called at exit. So asking debuggers/profilers to call them explicitly doesn't make the situation worse (nor better ;-)) for generators/coroutines, it would be exactly the same behavior. It's just an optimization.
The PEP 558 is being discussed for 5 years and still a draft. I don't think that it should hold bpo-42197 optimization.
No, what should hold up the bpo-42197 PR is the fact that it's an API compatibility break that shouldn't be done without a PEP. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (3)
-
Fabio Zadrozny
-
Nick Coghlan
-
Victor Stinner