POPT (Python Ob ject Provider Threads)

Hello, because Python is a very dynamic language the memory management is heavily used. A lot of time is used for creating (reserve memory and fill object structure with data) and destroying objects. Because of this and because of the discussions about the GIL I was wondering if there isn't a solution to get Python code really executed in parallel without the need to create several processes and without a huge overhead. And here comes the idea for POPT. With this idea the Python interpreter has running several threads in background (1 thread for each object type) which manage a set of objects as an object cache. Each object in the cache is already preconfigured by the object provider thread. So only the part of the object structure which is individual has to be initialized. This saves a lot of processing time for the main thread and the memory management has much less to do, because temporarily unused objects can be reused immediately. Another advantage is that every Python code uses several CPU cores in parallel, even if it is a single threaded application, without the need to change the Python code. If this idea is well implemented I expect a big performance improvement for all Python applications. Best regards, Martin

On Tue, 19 Jun 2018 16:47:46 +0200 Martin Bammer <mrbm74@gmail.com> wrote:
Do you have numbers about that? One modus operandi would be to collect profiling data using Linux "perf" on a real Python workload you care about.
How does the main thread (or, rather, the multiple application threads) communicate with the background object threads? What is the communication and synchronization overhead in this scheme? Regards Antoine.

On Tue, Jun 19, 2018 at 04:47:46PM +0200, Martin Bammer wrote:
And here comes the idea for POPT. With this idea the Python interpreter has running several threads in background (1 thread for each object type)
The builtins alone has 47 exception types. I hope you don't mean that each of those gets its own thread. Then there are builtins: str, bytes, int, float, set, frozenset, list, tuple, dict, type, bytearray, property, staticmethod, classmethod, range objects, zip objects, filter objects, slice objects, map objects, MethodType, FunctionType, ModuleType and probably more I forgot. That's 20 threads there. Don't forget things like code objects, DictProxies, BuiltinMethodOrFunction, etc. When I run threaded code on my computer, I find that about 6 or 8 threads is optimal, and more than that and the code slows down. You want to use 25-30 threads just for memory management. Why do you hate me? *wink*
For many objects, wouldn't that be close enough to "all of it"? (Apart from a couple of fields which never change.)
This saves a lot of processing time for the main thread
Do you know this for a fact or are you just hoping?
Or, unused objects can sit around for a long, long time, locking up memory in a cache that would be better allocated towards *used* objects.
How do you synchronise these threaded calls? Suppose I write this: x = (3.5, 4.5, "a", "b", {}, [], 1234567890, b"abcd", "c", 5.5) That has to syncronise 10 pieces of output from six threads before it can construct the tuple. How much overhead does that have?
If this idea is well implemented I expect a big performance improvement for all Python applications.
What are your reasons for this expectation? Do other interpreters do this? Have you tried an implementation and got promising results? -- Steve

On 20 June 2018 at 00:47, Martin Bammer <mrbm74@gmail.com> wrote:
If this idea is well implemented I expect a big performance improvement for all Python applications.
Given the free lists already maintained for several builtin types in the reference implementation, I suspect you may be disappointed on that front :) (While object creation overhead certainly isn't trivial, the interpreter's already pretty aggressive about repurposing previously allocated and initialised memory for new instances) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, I saw this free lists implementation this morning in the floatobject and listobject sources, which already handles the reuse of objects. I must admit I like this implementation. It's pretty smart. And yes I'm a little bit disappointed because it reduces the benefit of my idea a lot. Regards, Martin On 2018-06-20 13:29, Nick Coghlan wrote:

On Tue, 19 Jun 2018 16:47:46 +0200 Martin Bammer <mrbm74@gmail.com> wrote:
Do you have numbers about that? One modus operandi would be to collect profiling data using Linux "perf" on a real Python workload you care about.
How does the main thread (or, rather, the multiple application threads) communicate with the background object threads? What is the communication and synchronization overhead in this scheme? Regards Antoine.

On Tue, Jun 19, 2018 at 04:47:46PM +0200, Martin Bammer wrote:
And here comes the idea for POPT. With this idea the Python interpreter has running several threads in background (1 thread for each object type)
The builtins alone has 47 exception types. I hope you don't mean that each of those gets its own thread. Then there are builtins: str, bytes, int, float, set, frozenset, list, tuple, dict, type, bytearray, property, staticmethod, classmethod, range objects, zip objects, filter objects, slice objects, map objects, MethodType, FunctionType, ModuleType and probably more I forgot. That's 20 threads there. Don't forget things like code objects, DictProxies, BuiltinMethodOrFunction, etc. When I run threaded code on my computer, I find that about 6 or 8 threads is optimal, and more than that and the code slows down. You want to use 25-30 threads just for memory management. Why do you hate me? *wink*
For many objects, wouldn't that be close enough to "all of it"? (Apart from a couple of fields which never change.)
This saves a lot of processing time for the main thread
Do you know this for a fact or are you just hoping?
Or, unused objects can sit around for a long, long time, locking up memory in a cache that would be better allocated towards *used* objects.
How do you synchronise these threaded calls? Suppose I write this: x = (3.5, 4.5, "a", "b", {}, [], 1234567890, b"abcd", "c", 5.5) That has to syncronise 10 pieces of output from six threads before it can construct the tuple. How much overhead does that have?
If this idea is well implemented I expect a big performance improvement for all Python applications.
What are your reasons for this expectation? Do other interpreters do this? Have you tried an implementation and got promising results? -- Steve

On 20 June 2018 at 00:47, Martin Bammer <mrbm74@gmail.com> wrote:
If this idea is well implemented I expect a big performance improvement for all Python applications.
Given the free lists already maintained for several builtin types in the reference implementation, I suspect you may be disappointed on that front :) (While object creation overhead certainly isn't trivial, the interpreter's already pretty aggressive about repurposing previously allocated and initialised memory for new instances) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, I saw this free lists implementation this morning in the floatobject and listobject sources, which already handles the reuse of objects. I must admit I like this implementation. It's pretty smart. And yes I'm a little bit disappointed because it reduces the benefit of my idea a lot. Regards, Martin On 2018-06-20 13:29, Nick Coghlan wrote:
participants (5)
-
Antoine Pitrou
-
Guido van Rossum
-
Martin Bammer
-
Nick Coghlan
-
Steven D'Aprano