Changing pymalloc behaviour for long running processes

I know that this has been discussed a bit in the past, but I was hoping that some Python gurus could shed some light on this issue, and maybe let me know if there are any plans for solving this problem. I know a hack that might work, but there must be a better way to solve this problem. The short version of the problem is that obmalloc.c never frees memory. This is a great strategy if the application runs for a short time then quits, or if it has fairly constant memory usage. However, applications with very dynamic memory needs and that run for a long time do not perform well because Python hangs on to the peak amount of memory required, even if that memory is only required for a tiny fraction of the run time. With my application, I have a python process which occupy 1 GB of RAM for ~20 hours, even though it only uses that 1 GB for about 5 minutes. This is a problem that needs to be addressed, as it negatively impacts the performance of Python when manipulating very large data sets. In fact, I found a mailing list post where the poster was looking for a workaround for this issue, but I can't find it now. Some posts to various lists [1] have stated that this is not a real problem because virtual memory takes care of it. This is fair if you are talking about a couple megabytes. In my case, I'm talking about ~700 MB of wasted RAM, which is a problem. First, this is wasting space which could be used for disk cache, which would improve the performance of my system. Second, when the system decides to swap out the pages that haven't been used for a while, they are dirty and must be written to swap. If Python ever wants to use them again, they will be brought it from swap. This is much worse than informing the system that the pages can be discarded, and allocating them again later. In fact, the other native object types (ints, lists) seem to realize that holding on to a huge amount of memory indefinitely is a bad strategy, because they explicitly limit the size of their free lists. So why is this not a good idea for other types? Does anyone else see this as a problem? Does anyone think this is not a problem? Proposal: - Python's memory allocator should occasionally free memory if the memory usage has been relatively constant, and has been well below the amount of memory allocated. This will incur additional overhead to free the memory, and additional overhead to reallocate it if the memory is needed again quickly. However, it will make Python co-operate nicely with other processes, and a clever implementation should be able to reduce the overhead. Problem: - I do not completely understand Python's memory allocator, but from what I see, it will not easily support this. Gross Hack: I've been playing with the fact that the "collect" function in the gc module already gets called occasionally. Whenever it gets called for a level 2 collection, I've hacked it to call a cleanup function in obmalloc.c. This function goes through the free pool list, reorganizes it to decrease memory fragmentation and decides based on metrics collected from the last run if it should free some memory. It currently works fine, except that it will permit the arena vector to grow indefinitely, which is also bad for a long running process. It is also bad because these cleanups are relatively slow as they touch every free page that is currently allocated, so I'm trying to figure out a way to integrate them more cleanly into the allocator itself. This also requires that nothing call the allocation functions while this is happening. I believe that this is reasonable, considering that it is getting called from the cyclical garbage collector, but I don't know enough about Python internals to figure that out. Eventually, I hope to do some benchmarks and figure out if this is actually a reasonable strategy. However, I was hoping to get some feedback before I waste too much time on this. Evan Jones [1] http://groups.google.com/groups?selm=mailman.1053801468.4243.python- list%40python.org -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

[Evan Jones]
I know that this has been discussed a bit in the past, but I was hoping that some Python gurus could shed some light on this issue, and maybe let me know if there are any plans for solving this problem. I know a hack that might work, but there must be a better way to solve this problem.
I agree there are several issues here that are important for significant classes of apps, but have no plans to do anything about them (I simply don't have time for it). I'm not aware of anyone else intending to work on these areas either, so it's all yours <wink>.
The short version of the problem is that obmalloc.c never frees memory.
True. That's one major problem for some apps. Another major problem for some apps is due to unbounded internal free lists outside of obmalloc. Another is that the platform OS+libc may not shrink VM at times even when memory is returned to the system free(). ...
In fact, the other native object types (ints, lists) seem to realize that holding on to a huge amount of memory indefinitely is a bad strategy, because they explicitly limit the size of their free lists.
Most native object types don't have free lists (there are *many* native object types); they use pymalloc or the system malloc; type-specific free lists are generally found attached only to "high use" native types, where speed and/or memory-per-object was thought important enough to bother with a custom free list. Not all custom free lists are implemented in the same basic way. The most important oddballs are the free lists for ints and floats, which are unbounded and immortal. ...
Proposal: - Python's memory allocator should occasionally free memory
That's a worthy goal.
if the memory usage has been relatively constant, and has been well below the amount of memory allocated.
That's a possible implementation strategy. I think you'll find it helpful to distinguish goals from implementations.
This will incur additional overhead to free the memory, and additional overhead to reallocate it if the memory is needed again quickly. However, it will make Python co-operate nicely with other processes,
This is so complicated in real life -- depends on the OS, depends on details of the system malloc's implementation, "what works" on one platform may not work on another, etc.
and a clever implementation should be able to reduce the overhead.
Problem: - I do not completely understand Python's memory allocator, but from what I see, it will not easily support this.
Of course if it were easy for obmalloc to release unused arenas, it would already do so <0.3 wink>.
Gross Hack:
I've been playing with the fact that the "collect" function in the gc module already gets called occasionally. Whenever it gets called for a level 2 collection, I've hacked it to call a cleanup function in obmalloc.c. This function goes through the free pool list, reorganizes it to decrease memory fragmentation
Unsure what this means, because an object in CPython can never be relocated. If I view an obmalloc arena as an alternating sequence of blocks (a contiguous region of allocated objects) and gaps (a contiguous region of available bytes), then if I can't rearrange the blocks (and I can't), I can't rearrange the gaps either -- the set of gaps is the complement of the set of blocks. Maybe you just mean that you collapse adjacent free pools into a free pool of a larger size class, when possible? If so, that's a possible step on the way toward identifying unused arenas, but I wouldn't call it an instance of decreasing memory fragmentation. In apps with steady states, between steady-state transitions it's not a good idea to "artificially" collapse free pools into free pools of larger size, because the app is going to want to reuse pools of the specific sizes it frees, and obmalloc optimizes for that case. If the real point of this (whatever it is <wink>) is to identify free arenas, I expect that could be done a lot easier by keeping a count of allocated pools in each arena; e.g., maybe at the start of the arena, or by augmenting the vector of arena base addresses.
and decides based on metric collected from the last run if it should free some memory. It currently works fine, except that it will permit the arena vector to grow indefinitely, which is also bad for a long running process. It is also bad because these cleanups are relatively slow as they touch every free page that is currently allocated, so I'm trying to figure out a way to integrate them more cleanly into the allocator itself.
This also requires that nothing call the allocation functions while this is happening. I believe that this is reasonable, considering that it is getting called from the cyclical garbage collector, but I don't know enough about Python internals to figure that out.
In theory, the calling thread holds the GIL (global interpreter lock) whenever an obmalloc function is called. That's why the lock macros inside obmalloc expand to nothing (and not locking inside obmalloc is a significant speed win). But in some versions of reality, that isn't true. The best available explanation is in new_arena()'s long internal comment block: because of historical confusions about what Python's memory API *is*, it's possible that extension modules outside the core are incorrectly calling the obmalloc free() when they should be calling the system free(), and doing so without holding the GIL. At the time obmalloc last got a rework, we did find some extensions that were in fact mixing PyObject_{New, NEW} with PyMem_{Del, DEL, Free, FREE}. obmalloc endures extreme pain now to try to ensure that still works, despite the lack of proper thread locking. As the end of that comment block says, * Read the above 50 times before changing anything in this * block. Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now. I don't know whether it's possible to get away with that, though -- if some "important" extension module is still careless here, it will break in ways that are all of catastrophic, rare, and difficult to reproduce or analyze, If I could make time for this, I'd risk it (but for 2.5, not for 2.4.x), and proactively search for-- and repair --external extension modules that may still be insane in this respect.
Eventually, I hope to do some benchmarks and figure out if this is actually a reasonable strategy. However, I was hoping to get some feedback before I waste too much time on this.
It's only a waste if it ultimately fails <wink>.

At 12:14 PM -0400 10/19/04, Tim Peters wrote:
[Evan Jones]
The short version of the problem is that obmalloc.c never frees memory.
True. That's one major problem for some apps. Another major problem for some apps is due to unbounded internal free lists outside of obmalloc. Another is that the platform OS+libc may not shrink VM at times even when memory is returned to the system free().
FWIW, at this point nearly all OSes have a means of allocating memory from the system that can then later be returned to the system. (malloc and free tend *not* to do this) Even on Unix platforms you can play the "mmap a file with no filename" game to get returnable chunks. Unfortunately there's often a limit to the number of these chunks you can get from the OS so it's not safe to unconditionally replace malloc and free. (The performance impact isn't worth it either) If someone's going to do this, I'd suggest the place to start is adding separate allocate and free API entry points for returnable chunks and put in some criteria for getting memory from them (allocation size, particular spots that allocate, or whatever) and see where you go from there. I'll point out that, from experience, this can be a non-trivial thing, and with a non-moving GC system you'll probably find that there are relatively few places where there's a win from it. It does, though, tend to flush out dangling pointers. (Whether this is good or bad is a separate issue, of course ;) -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk

On Oct 19, 2004, at 12:14, Tim Peters wrote:
True. That's one major problem for some apps. Another major problem for some apps is due to unbounded internal free lists outside of obmalloc. Another is that the platform OS+libc may not shrink VM at times even when memory is returned to the system free().
There is absolutely nothing I can do about that, however. On platforms that matter to me (Mac OS X, Linux) some number of large malloc() allocations are done via mmap(), and can be immediately released when free() is called. Hence, large blocks are reclaimable. I have no knowledge about the implementation of malloc() on Windows. Anyone care to enlighten me? Another approach is to not free the memory, but instead to inform the operating system that the pages are unused (on Unix, madvise(2) with MADV_DONTNEED or MADV_FREE). When this happens, the operating system *may* discard the pages, but the address range remains valid: If it is touched again in the future, the OS will allocate the new page. This would require some dramatic changes to Python's internals.
if the memory usage has been relatively constant, and has been well below the amount of memory allocated. That's a possible implementation strategy. I think you'll find it helpful to distinguish goals from implementations.
You are correct: This is an implementation detail. However, it is a relatively important one, as I do not want to change Python's aggressive memory recycling behaviour.
Maybe you just mean that you collapse adjacent free pools into a free pool of a larger size class, when possible? If so, that's a possible step on the way toward identifying unused arenas, but I wouldn't call it an instance of decreasing memory fragmentation.
I am not moving around Python objects, I'm just dealing with free pools and arenas in obmalloc.c at the moment. There two separate things I am doing: 1. Scan through the free pool list, and count the number of free pools in each arena. If an arena is completely unused, I free it. If there is even one pool in use, the arena cannot be freed. 2. Sorting the free pool list so that "nearly full" arenas are used before "nearly empty" arenas. Right now, when a pool is free, it is pushed on the list. When one is needed, it is popped off. This leads to an LRU allocation of memory. What I am doing is removing all the free pools from the list, and putting them back on so that areas that have more free pools are used later, while arenas with less free pools are used first. In my crude tests, the second detail increases the number of completely free arenas. However, I suspect that differentiating between free arenas and used arenas, like is already done for pools, would be a good idea.
In apps with steady states, between steady-state transitions it's not a good idea to "artificially" collapse free pools into free pools of larger size, because the app is going to want to reuse pools of the specific sizes it frees, and obmalloc optimizes for that case.
Absolutely: I am not touching that. I'm working from the assumption that pymalloc has been well tested and well tuned and is appropriate for Python workloads. I'm just trying to make it free memory occasionally.
If the real point of this (whatever it is <wink>) is to identify free arenas, I expect that could be done a lot easier by keeping a count of allocated pools in each arena; e.g., maybe at the start of the arena, or by augmenting the vector of arena base addresses.
You are correct, and this is something I would like to play with. This is, of course, a tradeoff between overhead on each allocation and deallocation, and one big occasionally overhead caused by the "cleanup" process. I'm going to try and take a look at this tonight, if I get some real work done this afternoon.
But in some versions of reality, that isn't true. The best available explanation is in new_arena()'s long internal comment block: because of historical confusions about what Python's memory API *is*, it's possible that extension modules outside the core are incorrectly calling the obmalloc free() when they should be calling the system free(), and doing so without holding the GIL.
Let me just make sure I am clear on this: Some extensions use native threads, is that why this is a problem? Because as far as I am aware, the Python interpreter itself is not threaded. So how does the cyclical garbage collector work? Doesn't it require that there is no execution going on?
Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now.
I would rather not break this property of obmalloc. However, this leads to a big problem: I'm not sure it is possible to have an occasional cleanup task be lockless and co-operate nicely with other threads, since by definition it needs to go and mess with all the arenas. One of the reason that obmalloc *doesn't* have this problem is because it never releases memory.
It's only a waste if it ultimately fails <wink>.
It is also a waste if the core Python developers decide it is a bad idea, and don't want to accept patches! :) Thanks for your feedback, Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

Evan Jones wrote:
Another is that the platform OS+libc may not shrink VM at times even when memory is returned to the system free().
There is absolutely nothing I can do about that, however.
You could if you wanted. Don't use malloc/free, but use mmap/munmap, VirtualAlloc/VirtualFree, etc.
Anyone care to enlighten me?
Microsoft ships the source of its malloc(3) implementation together with VC; you need to install the CRT source to see it.
Let me just make sure I am clear on this: Some extensions use native threads, is that why this is a problem? Because as far as I am aware, the Python interpreter itself is not threaded. So how does the cyclical garbage collector work? Doesn't it require that there is no execution going on?
The garbage collector holds the GIL. So while there could be other threads running, they must not manipulate any PyObject*. If they try to, they need to obtain the GIL first, which will make them block until the garbage collector is complete.
It is also a waste if the core Python developers decide it is a bad idea, and don't want to accept patches! :)
That will ultimately depend on the patches. The feature itself would be fine, as Tim explains. However, patches might be rejected because: - they are incorrect, - their correctness cannot easily be established, - they change unrelated aspects of the interpreter, - they have undesirable performance properties, or - they have other problems I can't think of right now :-) Regards, Martin

[Evan Jones] ...
There is absolutely nothing I can do about that, however. On platforms that matter to me (Mac OS X, Linux) some number of large malloc() allocations are done via mmap(), and can be immediately released when free() is called. Hence, large blocks are reclaimable. I have no knowledge about the implementation of malloc() on Windows. Anyone care to enlighten me?
Not me, I'm too short on time. Memory pragmatics on Windows varies both across Windows flavors and MS C runtime releases, so it's not a simple topic. In practice, at least the NT+ flavors of Windows, under MS VC 6.0 and 7.1 + service packs, appear to do a reasonable job of releasing VM reservations when free() gives a large block back. I wouldn't worry about older Windows flavors anymore. The native Win32 API has many functions that could be used for fine control.
... I am not moving around Python objects, I'm just dealing with free pools and arenas in obmalloc.c at the moment.
Good.
There two separate things I am doing:
1. Scan through the free pool list, and count the number of free pools in each arena. If an arena is completely unused, I free it. If there is even one pool in use, the arena cannot be freed.
Yup.
2. Sorting the free pool list so that "nearly full" arenas are used before "nearly empty" arenas. Right now, when a pool is free, it is pushed on the list. When one is needed, it is popped off. This leads to an LRU allocation of memory.
It's stack-like: it reuses the pool most recently emptied, because the expectation is that the most recently emptied pool is the most likely of all empty pools to be highest in the memory hierarchy. I really don't know what LRU (or MRU) might mean in this context (it's not like we've evicting something from a cache).
What I am doing is removing all the free pools from the list, and putting them back on so that areas that have more free pools are used later, while arenas with less free pools are used first.
That sounds reasonable.
In my crude tests, the second detail increases the number of completely free arenas. However, I suspect that differentiating between free arenas and used arenas, like is already done for pools, would be a good idea.
Right. ...
Absolutely: I am not touching that. I'm working from the assumption that pymalloc has been well tested and well tuned and is appropriate for Python workloads. I'm just trying to make it free memory occasionally.
Harder than it looked, eh <wink>?
If the real point of this (whatever it is <wink>) is to identify free arenas, I expect that could be done a lot easier by keeping a count of allocated pools in each arena ...
You are correct, and this is something I would like to play with. This is, of course, a tradeoff between overhead on each allocation and deallocation,
It shouldn't be. Pool transitions among the "used", "full" and "empty" states don't occur on each alloc and dealloc. Note that PyObject_Free and PyObject_Malloc are both coded with the most frequent paths earliest in the function, and pool transitions don't occur until after a few return statements have passed. It's unusual not to get out via one of the "early returns"; the *bulk* of the code in each function (including pool transitions) isn't executed on most calls; in most calls, the affected pool both enters and leaves in the "used" state.
and one big occasionally overhead caused by the "cleanup" process.
Or it may be small overhead, if all it's trying to do is free() empty arenas. Indeed, if arenas "grow states" too, *arena* transitions should be so rare that perhaps they could afford to do extra processing right then to decide whether to free() an arena that just transitioned to its notion of an empty state. ...
Let me just make sure I am clear on this: Some extensions use native threads,
By extension module I mean a module coded in C; and yes, any extension module that uses threads is probably using native threads.
is that why this is a problem?
No, threads aren't the problem, in the sense that an alcoholic's problem isn't really alcohol, it's drinking <0.7 wink>. The problem is incorrect usage of the Python C API, and the most dangerous problem there is that old code may be calling PyMem_{Free, FREE, Del, DEL} while not holding the GIL. "Everyone always knew" that PyMem_{Free, FREE, Del, DEL} was just an irritating way to spell "free()", so some old code didn't worry about the GIL when calling it. Such code is fatally broken, but we're still trying to support it (or rather we *were*, when obmalloc was new; now it's still "supported" just in the sense that the excruciating support code still exists). The other twist is that we couldn''t map PyMem_{Free, FREE, Del, DEL} to the system free() directly (which would have solved the problem just above), because *other* broken old code called PyMem_{Free, FREE, Del, DEL} to release an object obtained via PyObject_New(). We're still supporting that too, but again just in the sense that the convolutions *to* support it still exist. If we changed PyMem_{Free, FREE, Del, DEL} to map to the system free(), all would be golden (except for broken old code mixing PyObject_ with PyMem_ calls). If any such broken code still exists, that remapping would lead to dramatic failures, easy to reproduce; and old code broken in the other, infinitely more subtle way (calling PyMem_{Free, FREE, Del, DEL} when not holding the GIL) would continue to work fine.
Because as far as I am aware, the Python interpreter itself is not threaded.
Unsure what that means to you. Any number of threads can be running Python code in a single process, although the GIL serializes their execution *while* they're executing Python code. When a thread ends up in C code, it's up to the C code to decide whether to release the GIL and so allow other threads to run at the same time. If it does, that thread must reacquire the GIL before making another Python C API call (with very few exceptions, related to Python C API thread initialization and teardown functions).
So how does the cyclical garbage collector work?
The same as every other part of Python's C implementation, *except* for this crazy exception in obmalloc: it assumes the GIL is held, and that no other thread can make a Python C API call until the GIL is released. Note that this doesn't necessarily mean that cyclic gc can assume that no other thread can run Python code until cyclic gc is done. Because gc may trigger destructors that in turn execute Python code (__del__ methods or weakref callbacks), it's all but certain other threads *can* run at such times (invoking Python code ends up in the interpreter main loop, which releases the GIL periodically to allow other threads to run). obmalloc doesn't have *that* problem, though -- nothing obmalloc does can cause Python code to get executed, so obmalloc can assume that the thread calling into it holds the GIL for as long as obmalloc wants. Except, again, for the crazy PyMem_{Free, FREE, Del, DEL} exception.
Doesn't it require that there is no execution going on?
As above.
Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now.
I would rather not break this property of obmalloc.
I would -- it's backward compatibility hacks for insane code, which may not even exist anymore, and you'll find that it puts severe contraints on what you can do.
However, this leads to a big problem: I'm not sure it is possible to have an occasional cleanup task be lockless and co-operate nicely with other threads, since by definition it needs to go and mess with all the arenas. One of the reason that obmalloc *doesn't* have this problem is because it never releases memory.
Yes, but that's backwards: obmalloc never releases memory in part *because* of this thread problem. Indeed, when new_arena() has to grow the vector of arena base addresses, it doesn't realloc(), it makes a copy into a new memory area, and deliberately lets the old vector *leak*. That's solely because some broken PyMem_{Free, FREE, Del, DEL} call may be simultaneously trying to access the vector, and without locking it's plain impossible to know whether or when that occurs. You'll have an equally impossible time trying to change the content of the arena base vector in virtually any way -- heck, we've got 40 lines of comments now just trying to explain what it took to support appending new values safely (and that's the only kind of mutation done on that vector now). Change PyMem_{Free, FREE, Del, DEL} to stop resolving to PyObject_ functions, and all that pain can go away -- obmalloc could then do anything it wanted to do without any thread worries.
It's only a waste if it ultimately fails <wink>.
It is also a waste if the core Python developers decide it is a bad idea, and don't want to accept patches! :)
Sad to say, it's more likely that making time to review patches will be the bottleneck, and in this area careful review is essential. It's great that you can make some time for this now -- be optimistic!

First, let me thank you for this very detailed reply. It really helped me understand a lot more about what is going on inside the Python interpreter. On Oct 19, 2004, at 16:53, Tim Peters wrote:
It's stack-like: it reuses the pool most recently emptied, because the expectation is that the most recently emptied pool is the most likely of all empty pools to be highest in the memory hierarchy. I really don't know what LRU (or MRU) might mean in this context (it's not like we've evicting something from a cache).
Err... Right: MRU. It uses the most recently used free block. This is totally a cache: It's a cache of free memory pages.
Harder than it looked, eh <wink>?
Actually, much. I spent about 6 hours figuring out what was going on. At this point, I think I have enough of a handle on the situation that I might as well go about trying to improve it.
Or it may be small overhead, if all it's trying to do is free() empty arenas. Indeed, if arenas "grow states" too, *arena* transitions should be so rare that perhaps they could afford to do extra processing right then to decide whether to free() an arena that just transitioned to its notion of an empty state.
That is true. However, I don't think freeing arenas immediately is the best plan, as we don't really want to do that if the application is cyclical in its memory consumption (ie. it creates a ton of objects, then releases them, then does it again). I still think that some sort of periodic collection is best, as it will help Python adjust to applications with a wide variety of memory profiles.
If we changed PyMem_{Free, FREE, Del, DEL} to map to the system free(), all would be golden (except for broken old code mixing PyObject_ with PyMem_ calls). If any such broken code still exists, that remapping would lead to dramatic failures, easy to reproduce; and old code broken in the other, infinitely more subtle way (calling PyMem_{Free, FREE, Del, DEL} when not holding the GIL) would continue to work fine.
Hmm... This seems like a logical approach to me. It certainly gives me a lot more freedom in reworking the memory allocator. Are there any objections to this idea?
Any number of threads can be running Python code in a single process, although the GIL serializes their execution *while* they're executing Python code. When a thread ends up in C code, it's up to the C code to decide whether to release the GIL and so allow other threads to run at the same time. If it does, that thread must reacquire the GIL before making another Python C API call (with very few exceptions, related to Python C API thread initialization and teardown functions).
Ah, now I understand! Creating a Python thread actually creates a native thread then, it's just that because of the GIL they run sequentially when executing Python code. This is an interesting approach! For some reason I was under the impression that the Python interpreter used user level threads to implement Python threads.
obmalloc doesn't have *that* problem, though -- nothing obmalloc does can cause Python code to get executed, so obmalloc can assume that the thread calling into it holds the GIL for as long as obmalloc wants. Except, again, for the crazy PyMem_{Free, FREE, Del, DEL} exception.
Terrific. This makes life much, much easier.
I would -- it's backward compatibility hacks for insane code, which may not even exist anymore, and you'll find that it puts severe contraints on what you can do.
Again, does anyone object to this point of view before I begin working from this assumption? This means that I can assume that only one thread will call code in obmalloc at a time. I can do the same thing that the current obmalloc implementation does: Add the macros for the locks, but have them resolve to nothing. Thanks for the tutorial in the Python interpreter internals, Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

Tim Peters <tim.peters@gmail.com> writes:
In theory, the calling thread holds the GIL (global interpreter lock) whenever an obmalloc function is called. That's why the lock macros inside obmalloc expand to nothing (and not locking inside obmalloc is a significant speed win).
But in some versions of reality, that isn't true. The best available explanation is in new_arena()'s long internal comment block: because of historical confusions about what Python's memory API *is*, it's possible that extension modules outside the core are incorrectly calling the obmalloc free() when they should be calling the system free(), and doing so without holding the GIL. At the time obmalloc last got a rework, we did find some extensions that were in fact mixing PyObject_{New, NEW} with PyMem_{Del, DEL, Free, FREE}. obmalloc endures extreme pain now to try to ensure that still works, despite the lack of proper thread locking. As the end of that comment block says,
* Read the above 50 times before changing anything in this * block.
Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now. I don't know whether it's possible to get away with that, though -- if some "important" extension module is still careless here, it will break in ways that are all of catastrophic, rare, and difficult to reproduce or analyze, If I could make time for this, I'd risk it (but for 2.5, not for 2.4.x), and proactively search for-- and repair --external extension modules that may still be insane in this respect.
Would it be possible to (in a debug build, presumably) do assert(I have the GIL); in PyObject_Free? Cheers, mwh -- If you don't have friends with whom to share links and conversation, you have social problems and you should confront them instead of joining a cultlike pseudo-community. -- "Quit Slashdot.org Today!" (http://www.cs.washington.edu/homes/klee/misc/slashdot.html#faq)

[Michael Hudson]
Would it be possible to (in a debug build, presumably) do
assert(I have the GIL);
in PyObject_Free?
There's no canned way to do this. I suppose it could call PyGILState_Ensure(), then assert that the return value is PyGILState_LOCKED. PyGILState_Ensure() has to do a lot of work to figure out whether its calling thread has the GIL, and needs access to pystate.c internals to get the right answer; I expect simpler approaches are doomed to fail in some cases. If we changed the PyMem_Free spellings now (for 2.5, not *now* now <wink>) to call the system free() instead in a release build, the thread craziness obmalloc is trying to protect against would just go away by magic. It would be good to do the assert() you suggest anyway, but for a different reason then (catching unsafe calls in debug builds, where all of the PyMem family ends up in obmalloc).

Evan Jones wrote:
Some posts to various lists [1] have stated that this is not a real problem because virtual memory takes care of it. This is fair if you are talking about a couple megabytes. In my case, I'm talking about ~700 MB of wasted RAM, which is a problem.
This is not true. The RAM is not wasted. As you explain later, the pages will be swapped out to swap space, making the RAM available again for other tasks.
First, this is wasting space which could be used for disk cache, which would improve the performance of my system.
And indeed, this is what the operating system does for you: free the memory (by swapping it out), then using the memory for disk cache, thus improving performance of your system.
Second, when the system decides to swap out the pages that haven't been used for a while, they are dirty and must be written to swap.
That is true.
If Python ever wants to use them again, they will be brought it from swap.
Yes. However, your assumption is that Python never wants to use them again, because the peek memory consumption is only local. On the design memory management systems, there is the notion of a working set. Python will have a relatively constant working set over short periods of time, and current operating systems will manage to keep the working set in memory if the system has sufficient memory in the first place. As the working set grows or shrinks, pages get swapped in and out. As Tim explains, this is really hard to avoid.
This is much worse than informing the system that the pages can be discarded, and allocating them again later.
Unfortunately, as Tim explains, there is no way to reliably "inform" the system. free(3) may or may not be taken as such information. Regards, Martin

On Oct 19, 2004, at 14:00, Martin v. Löwis wrote:
Some posts to various lists [1] have stated that this is not a real problem because virtual memory takes care of it. This is fair if you are talking about a couple megabytes. In my case, I'm talking about ~700 MB of wasted RAM, which is a problem. This is not true. The RAM is not wasted. As you explain later, the pages will be swapped out to swap space, making the RAM available again for other tasks.
Well, it isn't "wasted," but it is not optimal. If the pages were freed, the OS would use them for disk cache (or for other programs). However, because the operating system believes that these pages contain data it must either do one of the following two things: a) Live with less disk cache (lower performance for disk I/O). b) Pre-emptively swap the pages to disk, which is super slow. (On Linux, you can control how pre-emptive the kernel is by adjusting the "swapiness" sysctl). If it chooses to swap them out, the next time Python touches those pages, it will pause as the OS reads them back from disk. It can only help the system's performance if we give it hints about which pages are no longer in use.
If Python ever wants to use them again, they will be brought it from swap. Yes. However, your assumption is that Python never wants to use them again, because the peek memory consumption is only local.
I am trying to correct the situation where Python is not going to use the pages for a long time. For most applications, Python's memory allocation policies are fine, but if you have a long running process that does nothing most of the time (say a low usage server) or does some huge pre-processing (my application), it keeps a ton of memory around for no reason. Right now, Python has very poor performance for my application because I have this massive memory peak, and very low average memory usage. Were I using Java, its usage would grow and shrink accordingly, thanks to the garbage collector releasing memory to the OS. Yes, with Python, we can't compact memory, but I think we can still do better than nothing.
As the working set grows or shrinks, pages get swapped in and out. As Tim explains, this is really hard to avoid.
If you actually tell the operating system that the pages are unused, it won't swap unless it actually needs to. Right now, a lot of pages are being swapped in and out that are actually *garbage*.
Unfortunately, as Tim explains, there is no way to reliably "inform" the system. free(3) may or may not be taken as such information.
As noted before, free() may not be sufficient, but mmap or madvise are.
The garbage collector holds the GIL. So while there could be other threads running, they must not manipulate any PyObject*. If they try to, they need to obtain the GIL first, which will make them block until the garbage collector is complete.
But as noted in a previous message, some extensions may not do this correctly, and try to do PyObject_Free anyway. Is that the problem that obmalloc tries to avoid? If the problem is only the possibility of PyObject_Free being called while another thread has the GIL, then I can probably avoid that issue.
That will ultimately depend on the patches. The feature itself would be fine, as Tim explains.
Great! That's basically what I am looking for.
However, patches might be rejected because:
[snip] Of course, I certainly hope that Python wouldn't accept garbage patches! :) Thank you for your comments, Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

Here's an idea that may help implementation slightly, and will almost certainly increase the likelihood of any patch getting accepted: do the pool scanning and freeing only on specific call of e.g. gc.free_mem(). IIUC, you'll still need to change the allocation strategy slightly, but that's lower risk. Once your strategy has proven itself, we can start talking about ways to perform the freeing automatically. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ WiFi is the SCSI of the 21st Century -- there are fundamental technical reasons for sacrificing a goat. (with no apologies to John Woods)

On Oct 19, 2004, at 17:41, Aahz wrote:
Here's an idea that may help implementation slightly, and will almost certainly increase the likelihood of any patch getting accepted: do the pool scanning and freeing only on specific call of e.g. gc.free_mem(). IIUC, you'll still need to change the allocation strategy slightly, but that's lower risk.
This is a very reasonable suggestion, I'll definitely do this. It would also be easy enough to support both exposing this call to Python and doing it automatically based on some debugging macro. Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

On Tue, Oct 19, 2004 at 08:00:56PM +0200, "Martin v. L?wis" wrote:
Evan Jones wrote:
Some posts to various lists [1] have stated that this is not a real problem because virtual memory takes care of it. This is fair if you are talking about a couple megabytes. In my case, I'm talking about ~700 MB of wasted RAM, which is a problem.
This is not true. The RAM is not wasted. As you explain later, the pages will be swapped out to swap space, making the RAM available again for other tasks.
First, this is wasting space which could be used for disk cache, which would improve the performance of my system.
And indeed, this is what the operating system does for you: free the memory (by swapping it out), then using the memory for disk cache, thus improving performance of your system.
In the long run on a system the RAM may not be wasted once the OS happens to have swapped it out but the address space is still used. You're still consuming ~700 MB of your OS's total address space with swapped garbage. The fact that ultimately a lot of it ends up on disk as swap is not nice to other processes wanting memory (and disk space for oses using a dynamic swap). That said, here's a workaround for avoiding permanent huge memory consumption in known workloads: fork() before doing the part that consumes a ton of memory. afterwards return the results, post huge memory consumption, via pipe to the waiting parent process and exit the child so the parent can continue on not consuming 700mb.

On Thu, 2004-10-28 at 04:09, Gregory P. Smith wrote:
That said, here's a workaround for avoiding permanent huge memory consumption in known workloads:
fork() before doing the part that consumes a ton of memory. afterwards return the results, post huge memory consumption, via pipe to the waiting parent process and exit the child so the parent can continue on not consuming 700mb.
Right: This certainly is an effective workaround, but only for very specific, memory consuming tasks. Python should be better than to inflict this kind of hack on programmers. We shouldn't have to worry about an implementation detail of the Python interpreter. I don't believe that Jython has this problem, but it has been a while since I looked at it, so I could be wrong. I'm still planning on working on this issue: I just need to find the time. Evan Jones
participants (7)
-
"Martin v. Löwis"
-
Aahz
-
Dan Sugalski
-
Evan Jones
-
Gregory P. Smith
-
Michael Hudson
-
Tim Peters