Re: [Python-Dev] Changing pymalloc behaviour for long running processes

19 Oct 2004

      [Evan Jones]
...
I know that this has been discussed a bit in the past, but I was hoping
that some Python gurus could shed some light on this issue, and maybe
let me know if there are any plans for solving this problem. I know a
hack that might work, but there must be a better way to solve this
problem.
I agree there are several issues here that are important for
significant classes of apps, but have no plans to do anything about
them (I simply don't have time for it).  I'm not aware of anyone else
intending to work on these areas either, so it's all yours <wink>.
...
The short version of the problem is that obmalloc.c never frees memory.
True.  That's one major problem for some apps.  Another major problem
for some apps is due to unbounded internal free lists outside of
obmalloc.  Another is that the platform OS+libc may not shrink VM at
times even when memory is returned to the system free().

...
...
In fact, the other native object types (ints, lists) seem to realize that holding on
to a huge amount of memory indefinitely is a bad strategy, because they
explicitly limit the size of their free lists.
Most native object types don't have free lists (there are *many*
native object types); they use pymalloc or the system malloc;
type-specific free lists are generally found attached only to "high
use" native types, where speed and/or memory-per-object was thought
important enough to bother with a custom free list.

Not all custom free lists are implemented in the same basic way.  The
most important oddballs are the free lists for ints and floats, which
are unbounded and immortal.

...
...
Proposal:
- Python's memory allocator should occasionally free memory
That's a worthy goal.
...
if the memory usage has been relatively constant, and has been well
below the amount of memory allocated.
That's a possible implementation strategy.  I think you'll find it
helpful to distinguish goals from implementations.
...
This will incur additional overhead to free the memory, and additional overhead
to reallocate it if the memory is needed again quickly. However, it will make
Python co-operate nicely with other processes,
This is so complicated in real life -- depends on the OS, depends on
details of the system malloc's implementation, "what works" on one
platform may not work on another, etc.
...
and a clever implementation should be able to reduce the overhead.
Problem:
- I do not completely understand Python's memory allocator, but from
what I see, it will not easily support this.
Of course if it were easy for obmalloc to release unused arenas, it
would already do so <0.3 wink>.
...
Gross Hack:
I've been playing with the fact that the "collect" function in the gc
module already gets called occasionally. Whenever it gets called for a
level 2 collection, I've hacked it to call a cleanup function in
obmalloc.c. This function goes through the free pool list, reorganizes
it to decrease memory fragmentation
Unsure what this means, because an object in CPython can never be
relocated.  If I view an obmalloc arena as an alternating sequence of
blocks (a contiguous region of allocated objects) and gaps (a
contiguous region of available bytes), then if I can't rearrange the
blocks (and I can't), I can't rearrange the gaps either -- the set of
gaps is the complement of the set of blocks.

Maybe you just mean that you collapse adjacent free pools into a free
pool of a larger size class, when possible?  If so, that's a possible
step on the way toward identifying unused arenas, but I wouldn't call
it an instance of decreasing memory fragmentation.

In apps with steady states, between steady-state transitions it's not
a good idea to "artificially" collapse free pools into free pools of
larger size, because the app is going to want to reuse pools of the
specific sizes it frees, and obmalloc optimizes for that case.

If the real point of this (whatever it is <wink>) is to identify free
arenas, I expect that could be done a lot easier by keeping a count of
allocated pools in each arena; e.g., maybe at the start of the arena,
or by augmenting the vector of arena base addresses.
...
and decides based on metric collected from the last run if it should free
some memory. It currently works fine, except that it will permit the arena
vector to grow indefinitely, which is also bad for a long running process. It
is also bad because these cleanups are relatively slow as they touch every 
free page that is currently allocated, so I'm trying to figure out a way to
integrate them more cleanly into the allocator itself.
This also requires that nothing call the allocation functions while
this is happening. I believe that this is reasonable, considering that
it is getting called from the cyclical garbage collector, but I don't
know enough about Python internals to figure that out.
In theory, the calling thread holds the GIL (global interpreter lock)
whenever an obmalloc function is called.  That's why the lock macros
inside obmalloc expand to nothing (and not locking inside obmalloc is
a significant speed win).

But in some versions of reality, that isn't true.  The best available
explanation is in new_arena()'s long internal comment block:  because
of historical confusions about what Python's memory API *is*, it's
possible that extension modules outside the core are incorrectly
calling the obmalloc free() when they should be calling the system
free(), and doing so without holding the GIL.  At the time obmalloc
last got a rework, we did find some extensions that were in fact
mixing PyObject_{New, NEW} with PyMem_{Del, DEL, Free, FREE}. 
obmalloc endures extreme pain now to try to ensure that still works,
despite the lack of proper thread locking.  As the end of that comment
block says,

* Read the above 50 times before changing anything in this
* block.

Now all such insane uses have been officially deprecated, so you could
be bold and just assume obmalloc is always entered by a thread holding
the GIL now.  I don't know whether it's possible to get away with
that, though -- if some "important" extension module is still careless
here, it will break in ways that are all of catastrophic, rare, and
difficult to reproduce or analyze,  If I could make time for this, I'd
risk it (but for 2.5, not for 2.4.x), and proactively search for-- and
repair --external extension modules that may still be insane in this
respect.
...
Eventually, I hope to do some benchmarks and figure out if this is
actually a reasonable strategy. However, I was hoping to get some
feedback before I waste too much time on this.
It's only a waste if it ultimately fails <wink>.

Re: [Python-Dev] Changing pymalloc behaviour for long running processes

Tim Peters