[Evan Jones]
I know that this has been discussed a bit in the past, but I was hoping that some Python gurus could shed some light on this issue, and maybe let me know if there are any plans for solving this problem. I know a hack that might work, but there must be a better way to solve this problem.
I agree there are several issues here that are important for significant classes of apps, but have no plans to do anything about them (I simply don't have time for it). I'm not aware of anyone else intending to work on these areas either, so it's all yours <wink>.
The short version of the problem is that obmalloc.c never frees memory.
True. That's one major problem for some apps. Another major problem for some apps is due to unbounded internal free lists outside of obmalloc. Another is that the platform OS+libc may not shrink VM at times even when memory is returned to the system free(). ...
In fact, the other native object types (ints, lists) seem to realize that holding on to a huge amount of memory indefinitely is a bad strategy, because they explicitly limit the size of their free lists.
Most native object types don't have free lists (there are *many* native object types); they use pymalloc or the system malloc; type-specific free lists are generally found attached only to "high use" native types, where speed and/or memory-per-object was thought important enough to bother with a custom free list. Not all custom free lists are implemented in the same basic way. The most important oddballs are the free lists for ints and floats, which are unbounded and immortal. ...
Proposal: - Python's memory allocator should occasionally free memory
That's a worthy goal.
if the memory usage has been relatively constant, and has been well below the amount of memory allocated.
That's a possible implementation strategy. I think you'll find it helpful to distinguish goals from implementations.
This will incur additional overhead to free the memory, and additional overhead to reallocate it if the memory is needed again quickly. However, it will make Python co-operate nicely with other processes,
This is so complicated in real life -- depends on the OS, depends on details of the system malloc's implementation, "what works" on one platform may not work on another, etc.
and a clever implementation should be able to reduce the overhead.
Problem: - I do not completely understand Python's memory allocator, but from what I see, it will not easily support this.
Of course if it were easy for obmalloc to release unused arenas, it would already do so <0.3 wink>.
Gross Hack:
I've been playing with the fact that the "collect" function in the gc module already gets called occasionally. Whenever it gets called for a level 2 collection, I've hacked it to call a cleanup function in obmalloc.c. This function goes through the free pool list, reorganizes it to decrease memory fragmentation
Unsure what this means, because an object in CPython can never be relocated. If I view an obmalloc arena as an alternating sequence of blocks (a contiguous region of allocated objects) and gaps (a contiguous region of available bytes), then if I can't rearrange the blocks (and I can't), I can't rearrange the gaps either -- the set of gaps is the complement of the set of blocks. Maybe you just mean that you collapse adjacent free pools into a free pool of a larger size class, when possible? If so, that's a possible step on the way toward identifying unused arenas, but I wouldn't call it an instance of decreasing memory fragmentation. In apps with steady states, between steady-state transitions it's not a good idea to "artificially" collapse free pools into free pools of larger size, because the app is going to want to reuse pools of the specific sizes it frees, and obmalloc optimizes for that case. If the real point of this (whatever it is <wink>) is to identify free arenas, I expect that could be done a lot easier by keeping a count of allocated pools in each arena; e.g., maybe at the start of the arena, or by augmenting the vector of arena base addresses.
and decides based on metric collected from the last run if it should free some memory. It currently works fine, except that it will permit the arena vector to grow indefinitely, which is also bad for a long running process. It is also bad because these cleanups are relatively slow as they touch every free page that is currently allocated, so I'm trying to figure out a way to integrate them more cleanly into the allocator itself.
This also requires that nothing call the allocation functions while this is happening. I believe that this is reasonable, considering that it is getting called from the cyclical garbage collector, but I don't know enough about Python internals to figure that out.
In theory, the calling thread holds the GIL (global interpreter lock) whenever an obmalloc function is called. That's why the lock macros inside obmalloc expand to nothing (and not locking inside obmalloc is a significant speed win). But in some versions of reality, that isn't true. The best available explanation is in new_arena()'s long internal comment block: because of historical confusions about what Python's memory API *is*, it's possible that extension modules outside the core are incorrectly calling the obmalloc free() when they should be calling the system free(), and doing so without holding the GIL. At the time obmalloc last got a rework, we did find some extensions that were in fact mixing PyObject_{New, NEW} with PyMem_{Del, DEL, Free, FREE}. obmalloc endures extreme pain now to try to ensure that still works, despite the lack of proper thread locking. As the end of that comment block says, * Read the above 50 times before changing anything in this * block. Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now. I don't know whether it's possible to get away with that, though -- if some "important" extension module is still careless here, it will break in ways that are all of catastrophic, rare, and difficult to reproduce or analyze, If I could make time for this, I'd risk it (but for 2.5, not for 2.4.x), and proactively search for-- and repair --external extension modules that may still be insane in this respect.
Eventually, I hope to do some benchmarks and figure out if this is actually a reasonable strategy. However, I was hoping to get some feedback before I waste too much time on this.
It's only a waste if it ultimately fails <wink>.