Preallocated tuples and dicts for function calls

Hi, what about the idea that the interpreter preallocates and preinitializes the tuples and dicts for function calls where possible when loading a module? Before calling a function then the interpreter would just need to update the items which are dynamic and then call the function. Some examples: msgpack.unpackb(b'\x93\x01\x02\x03', use_list=False, raw=False) The above function call needs a tuple with 1 entry and a dict with 2 entries. All entries are constant. So in this case the interpreter can immediately execute the function call. Without the optimization the interpreter would need to: - create new tuple (allocate memory) - write constant into first tuple index. - create dict (allocate memory) - add key+value - add key+value - call function Another example: foo(bar, 3, 5, arg1=bar1, arg2=True) The above needs a tuple with 3 entries. 2 of them are constant. And a dict with 2 entries. 1 of them is constant. With the optimization: - write bar into first tuple index. - replace first key+value pair in the dict. - call function Without the optimization: - create new tuple (allocate memory) - write bar into first tuple index. - write constant into second tuple index. - write constant into third tuple index. - create dict (allocate memory) - add key+value - add key+value - call function If this idea is possible to implement I assume the function calls would receive a great speed improvment. Best regards, Martin

Martin Bammer wrote:
what about the idea that the interpreter preallocates and preinitializes the tuples and dicts for function calls where possible when loading a module?
This would not be thread-safe. Locking would be needed around uses of the preallocated objects, and that might take away some or all of the gain. It would also introduce unexpected interactions between threads. -- Greg

On Fri, Mar 08, 2019 at 10:16:02PM +0100, Martin Bammer wrote:
That's an implementation detail. CPython may or may not use tuples and dicts to call functions, but I don't think that's specified by the language. So we're talking about a potential optimization of one interpreter, not a language change. If the idea survives cursory discussion here, the Python-Dev mailing list is probably a better place to discuss it further.
Before calling a function then the interpreter would just need to update the items which are dynamic and then call the function.
As Greg points out, that would be unsafe when using threads. Let's say you have two threads, A and B, and both call function spam(). A wants to call spam(1, 2) and B wants to call spam(3, 4). Because of the unpredictable order that threaded code runs, we might have: A sets the argument tuple to (1, 2) B sets the argument tuple to (2, 3) B calls spam() A calls spam() # Oops! and mysterious, difficult to reproduce errors occur. It may be possible to solve this with locks, but that would probably slow code down horribly. [...]
Sure, and that happens at runtime, just before the function is called. But the same series of allocations would have to occur under your idea too, it would just happen when the module loads. And then the pre- allocated tuples and dicts would hang around forever, wasting memory. Even if it turns out that the function never actually gets called: for x in sequence: if condition(x): # always returns False! function(...) the compiler will have pre-allocated the memory to call it. So I suspect this is going to be very memory hungry. Trading off memory for speed might be worthwhile, but it is a trade-off that will make certain things worse rather than better.
If this idea is possible to implement I assume the function calls would receive a great speed improvment.
Well, it might decrease the overhead of calling a function, but that's usually only a small proportion of the total time to make function calls. So it might not help as much as you expect, except in the case where you have lots and lots of function calls each of which do only a tiny amount of work. But that has to be balanced against the slowdown that occurs when the module loads, when the same memory allocations (but not deallocations) would occur. Starting up Python is already pretty slow compared to other languages, this would probably make it worse. Even if it became a nett win for some applications, for others it would likely be a nett loss. My guess is that it would probably hurt the cases which are already uncomfortably slow, while benefitting the cases that don't need much optimization. But that's just a guess, and not an especially educated guess at that. -- Steven

Ok right. There are some details which need to modify the idea: - Thread safety: Instead of locking the thread id could be saved in the object and then checked when the object is used. If the thread id is wrong then a new object must be created. I think there is no additional locking necessary because of the GIL. - Startup time and memory waste: This idea can be improved if lazy object initalization is used. So that the tuples and dicts are only created when the function is called the first time and then these objects are kept in memory as long as the module is not unloaded. This would not hurt the startup time and save memory. One more detail which needs to be handled is recursive calling of the function. This can be easily handled by the reference counter. To keep the implementation simple and to not get too memory hungry this optimization should support just the first call level and not iterative calls. Regards, Martin Am Sa., 9. März 2019 um 03:23 Uhr schrieb Steven D'Aprano < steve@pearwood.info>:

On Fri, Mar 8, 2019 at 6:23 PM Steven D'Aprano <steve@pearwood.info> wrote:
I'm pretty sure the idea was to have constant tuples (1, 2) and (3, 4) in the module instead of LOAD_CONST 1/2/3/4 instructions in the bytecode. There's no setting of a hidden global variable involved. The kwargs dicts are a harder problem. I suppose they would have to be copy-on-write which would add too much complexity, or the language would have to be changed to allow/require kwargs to be a frozendict.
The bytecode for "function(...)" already hangs around forever even if it's never run. There is no need for tuple constants because you can generate LOAD_CONST a; LOAD_CONST b; ...; BUILD_TUPLE n at each usage point instead, but CPython has tuple constants, so they must have some space and/or speed benefit that was considered significant enough to be worth implementing them. It seems like args constants for function calls can be justified on similar grounds.

08.03.19 23:16, Martin Bammer пише:
A kind of this already has been implemented. Tuples and dicts use free lists, so when you allocate a small tuple or an empty dict, you usually do not use expensive memory allocating functions, but take a preallocated object from a free list. Although, this still has some overhead. This is why the property object had an attached preallocated tuple for passing the self argument to the getter function. But this was a complex and errorprone code. There were at least three attempts to fix it, and new flaws were found month later after every attempt. Finally this microoptimizations has been removed, and named tuples will use a special type for getters of their attribute in 3.8. Internally, CPython uses the private "fast" calling convention for many builtin functions and methods. It allows to avoid creating an intermediate tuple and dict at all. In future this convention will be exposed publically. Cython already uses it, so third-party extensions written in Cython have an advantage of using it.

On 2019-03-08 22:16, Martin Bammer wrote:
The basic premise here is wrong: function calls using the METH_FASTCALL convention don't need to allocate any temporary tuple/dict for function calls. Of course, not all function calls use METH_FASTCALL, but most of them where performance matters do.

Martin Bammer wrote:
what about the idea that the interpreter preallocates and preinitializes the tuples and dicts for function calls where possible when loading a module?
This would not be thread-safe. Locking would be needed around uses of the preallocated objects, and that might take away some or all of the gain. It would also introduce unexpected interactions between threads. -- Greg

On Fri, Mar 08, 2019 at 10:16:02PM +0100, Martin Bammer wrote:
That's an implementation detail. CPython may or may not use tuples and dicts to call functions, but I don't think that's specified by the language. So we're talking about a potential optimization of one interpreter, not a language change. If the idea survives cursory discussion here, the Python-Dev mailing list is probably a better place to discuss it further.
Before calling a function then the interpreter would just need to update the items which are dynamic and then call the function.
As Greg points out, that would be unsafe when using threads. Let's say you have two threads, A and B, and both call function spam(). A wants to call spam(1, 2) and B wants to call spam(3, 4). Because of the unpredictable order that threaded code runs, we might have: A sets the argument tuple to (1, 2) B sets the argument tuple to (2, 3) B calls spam() A calls spam() # Oops! and mysterious, difficult to reproduce errors occur. It may be possible to solve this with locks, but that would probably slow code down horribly. [...]
Sure, and that happens at runtime, just before the function is called. But the same series of allocations would have to occur under your idea too, it would just happen when the module loads. And then the pre- allocated tuples and dicts would hang around forever, wasting memory. Even if it turns out that the function never actually gets called: for x in sequence: if condition(x): # always returns False! function(...) the compiler will have pre-allocated the memory to call it. So I suspect this is going to be very memory hungry. Trading off memory for speed might be worthwhile, but it is a trade-off that will make certain things worse rather than better.
If this idea is possible to implement I assume the function calls would receive a great speed improvment.
Well, it might decrease the overhead of calling a function, but that's usually only a small proportion of the total time to make function calls. So it might not help as much as you expect, except in the case where you have lots and lots of function calls each of which do only a tiny amount of work. But that has to be balanced against the slowdown that occurs when the module loads, when the same memory allocations (but not deallocations) would occur. Starting up Python is already pretty slow compared to other languages, this would probably make it worse. Even if it became a nett win for some applications, for others it would likely be a nett loss. My guess is that it would probably hurt the cases which are already uncomfortably slow, while benefitting the cases that don't need much optimization. But that's just a guess, and not an especially educated guess at that. -- Steven

Ok right. There are some details which need to modify the idea: - Thread safety: Instead of locking the thread id could be saved in the object and then checked when the object is used. If the thread id is wrong then a new object must be created. I think there is no additional locking necessary because of the GIL. - Startup time and memory waste: This idea can be improved if lazy object initalization is used. So that the tuples and dicts are only created when the function is called the first time and then these objects are kept in memory as long as the module is not unloaded. This would not hurt the startup time and save memory. One more detail which needs to be handled is recursive calling of the function. This can be easily handled by the reference counter. To keep the implementation simple and to not get too memory hungry this optimization should support just the first call level and not iterative calls. Regards, Martin Am Sa., 9. März 2019 um 03:23 Uhr schrieb Steven D'Aprano < steve@pearwood.info>:

On Fri, Mar 8, 2019 at 6:23 PM Steven D'Aprano <steve@pearwood.info> wrote:
I'm pretty sure the idea was to have constant tuples (1, 2) and (3, 4) in the module instead of LOAD_CONST 1/2/3/4 instructions in the bytecode. There's no setting of a hidden global variable involved. The kwargs dicts are a harder problem. I suppose they would have to be copy-on-write which would add too much complexity, or the language would have to be changed to allow/require kwargs to be a frozendict.
The bytecode for "function(...)" already hangs around forever even if it's never run. There is no need for tuple constants because you can generate LOAD_CONST a; LOAD_CONST b; ...; BUILD_TUPLE n at each usage point instead, but CPython has tuple constants, so they must have some space and/or speed benefit that was considered significant enough to be worth implementing them. It seems like args constants for function calls can be justified on similar grounds.

08.03.19 23:16, Martin Bammer пише:
A kind of this already has been implemented. Tuples and dicts use free lists, so when you allocate a small tuple or an empty dict, you usually do not use expensive memory allocating functions, but take a preallocated object from a free list. Although, this still has some overhead. This is why the property object had an attached preallocated tuple for passing the self argument to the getter function. But this was a complex and errorprone code. There were at least three attempts to fix it, and new flaws were found month later after every attempt. Finally this microoptimizations has been removed, and named tuples will use a special type for getters of their attribute in 3.8. Internally, CPython uses the private "fast" calling convention for many builtin functions and methods. It allows to avoid creating an intermediate tuple and dict at all. In future this convention will be exposed publically. Cython already uses it, so third-party extensions written in Cython have an advantage of using it.

On 2019-03-08 22:16, Martin Bammer wrote:
The basic premise here is wrong: function calls using the METH_FASTCALL convention don't need to allocate any temporary tuple/dict for function calls. Of course, not all function calls use METH_FASTCALL, but most of them where performance matters do.
participants (6)
-
Ben Rudiak-Gould
-
Greg Ewing
-
Jeroen Demeyer
-
Martin Bammer
-
Serhiy Storchaka
-
Steven D'Aprano