[Python-Dev] Changing pymalloc behaviour for long running processes

Tue Oct 19 22:53:07 CEST 2004

[Evan Jones]
...
> There is absolutely nothing I can do about that, however. On platforms
> that matter to me (Mac OS X, Linux) some number of large malloc()
> allocations are done via mmap(), and can be immediately released when
> free() is called. Hence, large blocks are reclaimable. I have no
> knowledge about the implementation of malloc() on Windows. Anyone care
> to enlighten me?

Not me, I'm too short on time.  Memory pragmatics on Windows varies
both across Windows flavors and MS C runtime releases, so it's not a
simple topic.  In practice, at least the NT+ flavors of Windows, under
MS VC 6.0 and 7.1 + service packs, appear to do a reasonable job of
releasing VM reservations when free() gives a large block back.  I
wouldn't worry about older Windows flavors anymore.  The native Win32
API has many functions that could be used for fine control.

> ...
> I am not moving around Python objects, I'm just dealing with free pools
> and arenas in obmalloc.c at the moment.

Good.

> There two separate things I am doing:
>
> 1. Scan through the free pool list, and count the number of free pools
> in each arena. If an arena is completely unused, I free it. If there is
> even one pool in use, the arena cannot be freed.

Yup.

> 2. Sorting the free pool list so that "nearly full" arenas are used
> before "nearly empty" arenas. Right now, when a pool is free, it is
> pushed on the list. When one is needed, it is popped off. This leads to
> an LRU allocation of memory.

It's stack-like:  it reuses the pool most recently emptied, because
the expectation is that the most recently emptied pool is the most
likely of all empty pools to be highest in the memory hierarchy.  I
really don't know what LRU (or MRU) might mean in this context (it's
not like we've evicting something from a cache).

> What I am doing is removing all the free pools from the list, and putting them
> back on so that areas that have more free pools are used later, while arenas
> with less free pools are used first.

That sounds reasonable.

> In my crude tests, the second detail increases the number of completely
> free arenas. However, I suspect that differentiating between free
> arenas and used arenas, like is already done for pools, would be a good
> idea.

Right.

...

> Absolutely: I am not touching that. I'm working from the assumption
> that pymalloc has been well tested and well tuned and is appropriate
> for Python workloads. I'm just trying to make it free memory
> occasionally.

Harder than it looked, eh <wink>?

>> If the real point of this (whatever it is <wink>) is to identify free
>> arenas, I expect that could be done a lot easier by keeping a count of
>> allocated pools in each arena ...

> You are correct, and this is something I would like to play with. This
> is, of course, a tradeoff between overhead on each allocation and
> deallocation,

It shouldn't be.  Pool transitions among the "used", "full" and
"empty" states don't occur on each alloc and dealloc.  Note that
PyObject_Free and PyObject_Malloc are both coded with the most
frequent paths earliest in the function, and pool transitions don't
occur until after a few return statements have passed.  It's unusual
not to get out via one of the "early returns"; the *bulk* of the code
in each function (including pool transitions) isn't executed on most
calls; in most calls, the affected pool both enters and leaves in the
"used" state.

> and one big occasionally overhead caused by the "cleanup" process.

Or it may be small overhead, if all it's trying to do is free() empty
arenas.  Indeed, if arenas "grow states" too, *arena* transitions
should be so rare that perhaps they could afford to do extra
processing right then to decide whether to free() an arena that just
transitioned to its notion of an empty state.

...

> Let me just make sure I am clear on this: Some extensions use native
> threads,

By extension module I mean a module coded in C; and yes, any extension
module that uses threads is probably using native threads.

> is that why this is a problem?

No, threads aren't the problem, in the sense that an alcoholic's
problem isn't really alcohol, it's drinking <0.7 wink>.  The problem
is incorrect usage of the Python C API, and the most dangerous problem
there is that old code may be calling PyMem_{Free, FREE, Del, DEL}
while not holding the GIL.  "Everyone always knew" that PyMem_{Free,
FREE, Del, DEL} was just an irritating way to spell "free()", so some
old code didn't worry about the GIL when calling it.  Such code is
fatally broken, but we're still trying to support it (or rather we
*were*, when obmalloc was new; now it's still "supported" just in the
sense that the excruciating support code still exists).

The other twist is that we couldn''t map PyMem_{Free, FREE, Del, DEL} 
to the system free() directly (which would have solved the problem
just above), because *other* broken old code called PyMem_{Free, FREE,
Del, DEL} to release an object obtained via PyObject_New().  We're
still supporting that too, but again just in the sense that the
convolutions *to* support it still exist.

If we changed PyMem_{Free, FREE, Del, DEL} to map to the system
free(), all would be golden (except for broken old code mixing
PyObject_ with PyMem_ calls).  If any such broken code still exists,
that remapping would lead to dramatic failures, easy to reproduce; and
old code broken in the other, infinitely more subtle way (calling
PyMem_{Free, FREE, Del, DEL} when not holding the GIL) would continue
to work fine.

> Because as far as I am aware, the Python interpreter itself is not threaded.

Unsure what that means to you.  Any number of threads can be running
Python code in a single process, although the GIL serializes their
execution *while* they're executing Python code.  When a thread ends
up in C code, it's up to the C code to decide whether to release the
GIL and so allow other threads to run at the same time.  If it does,
that thread must reacquire the GIL before making another Python C API
call (with very few exceptions, related to Python C API thread
initialization and teardown functions).

> So how does the cyclical garbage collector work?

The same as every other part of Python's C implementation, *except*
for this crazy exception in obmalloc:  it assumes the GIL is held, and
that no other thread can make a Python C API call until the GIL is
released.  Note that this doesn't necessarily mean that cyclic gc can
assume that no other thread can run Python code until cyclic gc is
done.  Because gc may trigger destructors that in turn execute Python
code (__del__ methods or weakref callbacks), it's all but certain
other threads *can* run at such times (invoking Python code ends up in
the interpreter main loop, which releases the GIL periodically to
allow other threads to run).

obmalloc doesn't have *that* problem, though -- nothing obmalloc does
can cause Python code to get executed, so obmalloc can assume that the
thread calling into it holds the GIL for as long as obmalloc wants. 
Except, again, for the crazy PyMem_{Free, FREE, Del, DEL} exception.

> Doesn't it require that there is no execution going on?

As above.

>> Now all such insane uses have been officially deprecated, so you could
>> be bold and just assume obmalloc is always entered by a thread holding
>> the GIL now.

> I would rather not break this property of obmalloc.

I would -- it's backward compatibility hacks for insane code, which
may not even exist anymore, and you'll find that it puts severe
contraints on what you can do.

> However, this leads to a big problem: I'm not sure it is possible to have an
> occasional cleanup task be lockless and co-operate nicely with other threads,
> since by definition it needs to go and mess with all the arenas. One of
> the reason that obmalloc *doesn't* have this problem is because it
> never releases memory.

Yes, but that's backwards:  obmalloc never releases memory in part
*because* of this thread problem.  Indeed, when new_arena() has to
grow the vector of arena base addresses, it doesn't realloc(), it
makes a copy into a new memory area, and deliberately lets the old
vector *leak*.  That's solely because some broken PyMem_{Free, FREE,
Del, DEL} call may be simultaneously trying to access the vector, and
without locking it's plain impossible to know whether or when that
occurs.  You'll have an equally impossible time trying to change the
content of the arena base vector in virtually any way -- heck, we've
got 40 lines of comments now just trying to explain what it took to
support appending new values safely (and that's the only kind of
mutation done on that vector now).

Change PyMem_{Free, FREE, Del, DEL} to stop resolving to PyObject_
functions, and all that pain can go away -- obmalloc could then do
anything it wanted to do without any thread worries.

>> It's only a waste if it ultimately fails <wink>.

> It is also a waste if the core Python developers decide it is a bad
> idea, and don't want to accept patches! :)

Sad to say, it's more likely that making time to review patches will
be the bottleneck, and in this area careful review is essential.  It's
great that you can make some time for this now -- be optimistic!