Re: [Python-Dev] cpython: Implement PEP 412: Key-sharing dictionaries (closes #13903)

On Mon, 23 Apr 2012 17:24:57 +0200 benjamin.peterson <python-checkins@python.org> wrote:
I hope someone can measure the results of this change on real-world code. Benchmark results with http://hg.python.org/benchmarks/ are not overly promising. Regards Antoine.

On Mon, 23 Apr 2012 22:22:18 +0200, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm pretty sure that anything heavily using sqlalchemy will benefit, so that would be a good place to look for a real-world benchmark. --David

Probably any benchmark involving a large amount of object instances with non-trivial dictionaries. Benchmarks should measure memory usage too, of course. Sadly that is not possible in standard cPython. Our 2.7 branch has extensive patching to allow custom memory allocators to be used (it even eliminates the explicit "malloc" calls used here and there in the code) and exposes some functions, such as sys.getpymalloced(), useful for memory benchmarking. Perhaps I should write about this on my blog. Updating the memory allocation macro layer in cPython for embedding is something I'd be inclined to contribute, but it will involve a large amount of bikeshedding, I'm sure :) Btw, this is of great interest to me at the moment, our Shanghai engineers are screaming at the memory waste incurred by dictionaries. A 10 item dictionary consumes 1/2k on 32 bits, did you know this? K

On Tue, 24 Apr 2012 10:24:16 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
The sparseness of hash tables is a well-known time/space tradeoff. See e.g. http://bugs.python.org/issue10408 Regards Antoine.

On Tue, Apr 24, 2012 at 8:24 PM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Trawl the tracker before you do - I'm pretty sure there's a patch (from the Nokia S60 port, IIRC) that adds a couple of macro definitions so that platform ports and embedding applications can intercept malloc() and free() calls. It would be way out of date by now, but I seem to recall thinking it looked reasonable at a quick glance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks. Meanwhile, I blogged about tuning the dict implementation. Preliminary testing seems to indicate that tuning it to conserve memory saves us 2Mb of wasted slots on the login screen. No small thing on a PS3 system. http://blog.ccpgames.com/kristjan/2012/04/25/optimizing-the-dict/ I wonder if we shouldn't make those factors into #defines as I did in my 2.7 modifications, and even provide a "memory saving" predefine for embedders. (Believe it or not, sometimes python performance is not an issue at all, but memory usage is.) K

Benchmarks should measure memory usage too, of course. Sadly that is not possible in standard cPython.
It's actually very easy in standard CPython, using sys.getsizeof.
I did. In Python 3.3, this now goes down to 248 bytes (32 bits). Regards, Martin

Yes, you can query each python object about how big it thinks it is. What I'm speaking of is more like: start_allocs, start_mem = allocator.get_current() allocator.reset_limits() run_complicated_tests() end_allocs, end_mem = allocator.get=current() Print "delta blocks: %d, delta mem: %d"%(end_allocs-start_allocs, end_mem-start_mem) print "peak blocks: %d, peak mem: %d"%allocator.peak()
I'm going to experiment with tunable parameters in 2.7 to trade performance for memory. In some applications, memory trumps performance. K

Kristján Valur Jónsson wrote:
Take a look at the benchmark suite at http://hg.python.org/benchmarks/ The test runner has an -m option that profiles memory usage, you could take a look at how that is implemented Cheers, Mark.

Yes, out of process monitoring of memory as reported by the OS. We do gather those counters as well on clients and servers. But they don't give you the granularity you want when checking for memory leaks and memory usage by certain algorithms. In the same way that the unittests have reference leak reports, they could just have memory usage reports, if the underlying allocator supported that. FYI the current state of affairs of the cPython 2.7 branch we use is as follows: 1) We allow the API user to specify the base allocator python uses, both for regular allocs and allocating blocks for the obmalloc one, using: /* Support for custom allocators */ typedef void *(*PyCCP_Malloc_t)(size_t size, void *arg, const char *file, int line, const char *msg); typedef void *(*PyCCP_Realloc_t)(void *ptr, size_t size, void *arg, const char *file, int line, const char *msg); typedef void (*PyCCP_Free_t)(void *ptr, void *arg, const char *file, int line, const char *msg); typedef size_t (*PyCCP_Msize_t)(void *ptr, void *arg); typedef struct PyCCP_CustomAllocator_t { PyCCP_Malloc_t pMalloc; PyCCP_Realloc_t pRealloc; PyCCP_Free_t pFree; PyCCP_Msize_t pMsize; /* can be NULL, or return -1 if no size info is avail. */ void *arg; /* opaque argument for the functions */ } PyCCP_CustomAllocator_t; /* To set an allocator! use 0 for the regular allocator, 1 for the block allocator. * pass a null pointer to reset to internal default */ PyAPI_FUNC(void) PyCCP_SetAllocator(int which, const PyCCP_CustomAllocator_t *); /* for BLUE to set the current context */ /* internal data member */ extern PyCCP_CustomAllocator_t _PyCCP_CustomAllocator[]; 2) using ifdefs, the macros will delegate all final allocations through these allocators. This includes all the "naked" malloc calls scattered about, they are patched up using #defines. 3) Additionally, there is an internal layer of management, before delegating to the external allocators. This internal manager provides statistics, exposed through the "sys" module. The layering is something like this, all more or less definable by pre-processor macros. (raw malloc() is turned into something else via pre-processor magic and a special "patch_malloc.h" file added to the modules which uses raw malloc()) PyMem_Malloc() PyObject_Malloc() | | v v Mem bookkeeping obj bookkeeping | | | v malloc() | obmallocator | | | v v v PyMem_MALLOC_RAW() PyObject_MALLOC_RAW | | v v malloc() or vectored allocator specified through API function Cheers, K

This is easy in a debug build, using sys.getobjects(). In a release build, you can use pympler: start = pympler.muppy.get_size(pympler.muppy.get_objects()) run_complicated_tests() end = pympler.muppy.get_size(pympler.muppy.get_objects()) print "delta mem: %d" % (end-start) Regards, Martin

Thanks for pointing out pympler to me. Sounds like fun, I'll try it out. I should point out that gc.get_objects() also works, if you don't care about stuff like ints and floats. Another reason why I like the runtime stats we have built in, however, is that they provide no query overhead. You can query the current resource usage as often as you like and this is important in a running app. We log python memory usage every second or so. Cheers, K

On Mon, 23 Apr 2012 22:22:18 +0200, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm pretty sure that anything heavily using sqlalchemy will benefit, so that would be a good place to look for a real-world benchmark. --David

Probably any benchmark involving a large amount of object instances with non-trivial dictionaries. Benchmarks should measure memory usage too, of course. Sadly that is not possible in standard cPython. Our 2.7 branch has extensive patching to allow custom memory allocators to be used (it even eliminates the explicit "malloc" calls used here and there in the code) and exposes some functions, such as sys.getpymalloced(), useful for memory benchmarking. Perhaps I should write about this on my blog. Updating the memory allocation macro layer in cPython for embedding is something I'd be inclined to contribute, but it will involve a large amount of bikeshedding, I'm sure :) Btw, this is of great interest to me at the moment, our Shanghai engineers are screaming at the memory waste incurred by dictionaries. A 10 item dictionary consumes 1/2k on 32 bits, did you know this? K

On Tue, 24 Apr 2012 10:24:16 +0000 Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
The sparseness of hash tables is a well-known time/space tradeoff. See e.g. http://bugs.python.org/issue10408 Regards Antoine.

On Tue, Apr 24, 2012 at 8:24 PM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:
Trawl the tracker before you do - I'm pretty sure there's a patch (from the Nokia S60 port, IIRC) that adds a couple of macro definitions so that platform ports and embedding applications can intercept malloc() and free() calls. It would be way out of date by now, but I seem to recall thinking it looked reasonable at a quick glance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks. Meanwhile, I blogged about tuning the dict implementation. Preliminary testing seems to indicate that tuning it to conserve memory saves us 2Mb of wasted slots on the login screen. No small thing on a PS3 system. http://blog.ccpgames.com/kristjan/2012/04/25/optimizing-the-dict/ I wonder if we shouldn't make those factors into #defines as I did in my 2.7 modifications, and even provide a "memory saving" predefine for embedders. (Believe it or not, sometimes python performance is not an issue at all, but memory usage is.) K

Benchmarks should measure memory usage too, of course. Sadly that is not possible in standard cPython.
It's actually very easy in standard CPython, using sys.getsizeof.
I did. In Python 3.3, this now goes down to 248 bytes (32 bits). Regards, Martin

Yes, you can query each python object about how big it thinks it is. What I'm speaking of is more like: start_allocs, start_mem = allocator.get_current() allocator.reset_limits() run_complicated_tests() end_allocs, end_mem = allocator.get=current() Print "delta blocks: %d, delta mem: %d"%(end_allocs-start_allocs, end_mem-start_mem) print "peak blocks: %d, peak mem: %d"%allocator.peak()
I'm going to experiment with tunable parameters in 2.7 to trade performance for memory. In some applications, memory trumps performance. K

Kristján Valur Jónsson wrote:
Take a look at the benchmark suite at http://hg.python.org/benchmarks/ The test runner has an -m option that profiles memory usage, you could take a look at how that is implemented Cheers, Mark.

Yes, out of process monitoring of memory as reported by the OS. We do gather those counters as well on clients and servers. But they don't give you the granularity you want when checking for memory leaks and memory usage by certain algorithms. In the same way that the unittests have reference leak reports, they could just have memory usage reports, if the underlying allocator supported that. FYI the current state of affairs of the cPython 2.7 branch we use is as follows: 1) We allow the API user to specify the base allocator python uses, both for regular allocs and allocating blocks for the obmalloc one, using: /* Support for custom allocators */ typedef void *(*PyCCP_Malloc_t)(size_t size, void *arg, const char *file, int line, const char *msg); typedef void *(*PyCCP_Realloc_t)(void *ptr, size_t size, void *arg, const char *file, int line, const char *msg); typedef void (*PyCCP_Free_t)(void *ptr, void *arg, const char *file, int line, const char *msg); typedef size_t (*PyCCP_Msize_t)(void *ptr, void *arg); typedef struct PyCCP_CustomAllocator_t { PyCCP_Malloc_t pMalloc; PyCCP_Realloc_t pRealloc; PyCCP_Free_t pFree; PyCCP_Msize_t pMsize; /* can be NULL, or return -1 if no size info is avail. */ void *arg; /* opaque argument for the functions */ } PyCCP_CustomAllocator_t; /* To set an allocator! use 0 for the regular allocator, 1 for the block allocator. * pass a null pointer to reset to internal default */ PyAPI_FUNC(void) PyCCP_SetAllocator(int which, const PyCCP_CustomAllocator_t *); /* for BLUE to set the current context */ /* internal data member */ extern PyCCP_CustomAllocator_t _PyCCP_CustomAllocator[]; 2) using ifdefs, the macros will delegate all final allocations through these allocators. This includes all the "naked" malloc calls scattered about, they are patched up using #defines. 3) Additionally, there is an internal layer of management, before delegating to the external allocators. This internal manager provides statistics, exposed through the "sys" module. The layering is something like this, all more or less definable by pre-processor macros. (raw malloc() is turned into something else via pre-processor magic and a special "patch_malloc.h" file added to the modules which uses raw malloc()) PyMem_Malloc() PyObject_Malloc() | | v v Mem bookkeeping obj bookkeeping | | | v malloc() | obmallocator | | | v v v PyMem_MALLOC_RAW() PyObject_MALLOC_RAW | | v v malloc() or vectored allocator specified through API function Cheers, K

This is easy in a debug build, using sys.getobjects(). In a release build, you can use pympler: start = pympler.muppy.get_size(pympler.muppy.get_objects()) run_complicated_tests() end = pympler.muppy.get_size(pympler.muppy.get_objects()) print "delta mem: %d" % (end-start) Regards, Martin

Thanks for pointing out pympler to me. Sounds like fun, I'll try it out. I should point out that gc.get_objects() also works, if you don't care about stuff like ints and floats. Another reason why I like the runtime stats we have built in, however, is that they provide no query overhead. You can query the current resource usage as often as you like and this is important in a running app. We log python memory usage every second or so. Cheers, K
participants (7)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Kristján Valur Jónsson
-
Mark Shannon
-
martin@v.loewis.de
-
Nick Coghlan
-
R. David Murray