
On Tue, 19 Oct 2004 12:02:14 +0200 (CEST), Evan Jones <ejones@uwaterloo.ca> wrote:
Subject: [Python-Dev] Changing pymalloc behaviour for long running processes
[ snip ]
The short version of the problem is that obmalloc.c never frees memory. This is a great strategy if the application runs for a short time then quits, or if it has fairly constant memory usage. However, applications with very dynamic memory needs and that run for a long time do not perform well because Python hangs on to the peak amount of memory required, even if that memory is only required for a tiny fraction of the run time. With my application, I have a python process which occupy 1 GB of RAM for ~20 hours, even though it only uses that 1 GB for about 5 minutes. This is a problem that needs to be addressed, as it negatively impacts the performance of Python when manipulating very large data sets. In fact, I found a mailing list post where the poster was looking for a workaround for this issue, but I can't find it now.
Some posts to various lists [1] have stated that this is not a real problem because virtual memory takes care of it. This is fair if you are talking about a couple megabytes. In my case, I'm talking about ~700 MB of wasted RAM, which is a problem. First, this is wasting space which could be used for disk cache, which would improve the performance of my system. Second, when the system decides to swap out the pages that haven't been used for a while, they are dirty and must be written to swap. If Python ever wants to use them again, they will be brought it from swap. This is much worse than informing the system that the pages can be discarded, and allocating them again later. In fact, the other native object types (ints, lists) seem to realize that holding on to a huge amount of memory indefinitely is a bad strategy, because they explicitly limit the size of their free lists. So why is this not a good idea for other types?
Does anyone else see this as a problem?
This is such a big problem for us that we had to rewrite some of our daemons to fork request handlers so that the memory would be freed. That's the only way we've found to deal with it, and it seems, that's the preferred python way of doing things, using processes, IPC, fork, etc. instead of threads. In order to be able to release memory, the interpreter has to allocate memory in chunks bigger than the minimum that can be returned to the OS, e.g., in Linux that'd be 256bytes (iirc), so that libc's malloc would use mmap to allocate that chunk. Otherwise, if the memory was obtained with brk, then in most virtually all OSes and malloc implementations, it won't be returned to the OS even if the interpreter frees the memory. For example, consider the following code in the interactive interpreter: for i in range(10000000): pass That run will create a lot of little integer objects and the virtual memory size of the interpreter will quickly grow to 155MB and then drop to 117MB. The 117MB left are all those little integer objects that are not in use any more that the interpreter would reuse as needed. When the system needs memory, it will page out the pages where these objects have been allocated to swap. In our application, paging to swap is extremely bad because sometimes we're running the OS booted from the net without swap. The daemon has to loop over list of 20 to 40 thousand items at a time and it quickly grows to 60mb on the first run and then continues to grow from there. When something else needs memory, it tries to swap and then crashes. In the example above, the difference between 155MB and 117MB is 37MB, which I assume is the size of the list object returned by 'range()' which contains the references to the integers. The list goes away when the interpreter finishes running the loop and because it was already known how big it was going to be, it was allocated as a big chunk using mmap (my speculation). As a result, that memory was given back to the OS and the virtual memory size of the interpreter went down from 155MB to 117MB. Regards, -- Luis P Caamano Atlanta, GA USA PS I rarely post to python-dev, this is probably the first time, so please let me take this opportunity to thank all the python developers for all your efforts, such a great language, and great tool. My respect and admiration to all of you.

On Oct 19, 2004, at 8:32, Luis P Caamano wrote:
This is such a big problem for us that we had to rewrite some of our daemons to fork request handlers so that the memory would be freed. That's the only way we've found to deal with it, and it seems, that's the preferred python way of doing things, using processes, IPC, fork, etc. instead of threads.
Phew! I'm glad I'm not the only one who is frustrated by this. I am willing to put the time in to fix this issue, as long as the Python community doesn't think this is a dumb idea.
In order to be able to release memory, the interpreter has to allocate memory in chunks bigger than the minimum that can be returned to the OS, e.g., in Linux that'd be 256bytes (iirc), so that libc's malloc would use mmap to allocate that chunk. Otherwise, if the memory was obtained with brk, then in most virtually all OSes and malloc implementations, it won't be returned to the OS even if the interpreter frees the memory.
Absolutely correct. Luckily, this is not a problem for Python, as its current behaviour is this: a) If the allocation is > 256 bytes, call the system malloc. b) If the allocation is < 256, use its own malloc implementation, which allocates memory in 256 kB chunks and never releases it. I want to change the behaviour of (b). If I call free() on these 256 kB chunks, they *do* in fact get released to the operating system.
That run will create a lot of little integer objects and the virtual memory size of the interpreter will quickly grow to 155MB and then drop to 117MB. The 117MB left are all those little integer objects that are not in use any more that the interpreter would reuse as needed.
Oops, you are right: I was incorrect in my previous post. Python does not limit the number of free integers, etc. that it keeps around. This is part of the problem: Each of Python's own native types maintains its own freelist. If you go create a huge list of ints, deallocate it, then create a huge list of floats, it will allocate more free memory, even though it doesn't need to. If the pymalloc allocator was used instead, it would recycle the memory that was used by the integers. At the moment, even if I change the behaviour of pymalloc, it still won't solve this problem, as each type keeps its own freelist. This is also on my TODO list: benchmark the performance of using the pymalloc allocator for these elements.
In our application, paging to swap is extremely bad because sometimes we're running the OS booted from the net without swap. The daemon has to loop over list of 20 to 40 thousand items at a time and it quickly grows to 60mb on the first run and then continues to grow from there. When something else needs memory, it tries to swap and then crashes.
This is another good point: Some systems do not have swap. The changes I am proposing would not make Python less memory hungry, but they would mean that
In the example above, the difference between 155MB and 117MB is 37MB, which I assume is the size of the list object returned by 'range()' which contains the references to the integers. The list goes away when the interpreter finishes running the loop and because it was already known how big it was going to be, it was allocated as a big chunk using mmap (my speculation). As a result, that memory was given back to the OS and the virtual memory size of the interpreter went down from 155MB to 117MB.
That is correct. If you look at the implementation for lists, it keeps a maximum of 80 free lists around, and immediately frees the memory for the containing array. Again, this seems like it is sub-optimal to me: In some cases, if a program uses a lot of lists, 80 lists may not be enough. For others, 80 may be too much. It seems to me that a more dynamic allocation policy could be more efficient. Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

Evan Jones wrote:
That is correct. If you look at the implementation for lists, it keeps a maximum of 80 free lists around, and immediately frees the memory for the containing array. Again, this seems like it is sub-optimal to me: In some cases, if a program uses a lot of lists, 80 lists may not be enough. For others, 80 may be too much. It seems to me that a more dynamic allocation policy could be more efficient.
I knew this discussion sounded familiar. . . http://mail.python.org/pipermail/python-dev/2004-June/045403.html (and assorted replies) I'm not saying I *like* the unbounded lists. . . but there's a reason they're still like that (i.e. getting the memory usage down tends to take some of Python's speed with it - and there isn't exactly a lot of that to be spared!). Still, fresh eyes on the problem may see something new :) Cheers, Nick.

On Oct 19, 2004, at 9:44, Nick Coghlan wrote:
I knew this discussion sounded familiar. . .
http://mail.python.org/pipermail/python-dev/2004-June/045403.html (and assorted replies)
Great! Thank you for the reference, this discussion is useful. Why is it that it is so hard to find stuff in mailing list archives? I spent about 5 hours poking around Google and manually browsing the archives, and didn't see that thread.
I'm not saying I *like* the unbounded lists. . . but there's a reason they're still like that (i.e. getting the memory usage down tends to take some of Python's speed with it - and there isn't exactly a lot of that to be spared!).
Of course, and this has to be carefully considered and carefully benchmarked as well. I don't think that a bounded free list approach is the answer, as it can be shown that any arbitrary, fixed limit is bad is some situations. Basically, I'm talking about including some cache management into Python. My idea is to add some instrumentation that monitors the size of the free lists over time. If the free lists are much larger than what is being used, go and free some of them. This will allow Python's memory usage to dynamically adjust to the workload, and should limit the overhead. There are two challenges to this idea: 1. How can we run such an occasional task? It should be called regularly, if Python is actively executing or not, just like a garbage collector thread would be. I don't think making this collection part of the normal call to allocate/free objects is a good idea, because that would add some expensive overhead in what needs to be a fast path for Python. 2. Can this be done with minimal overhead? The instrumentation needs to have near zero overhead, and the collection task needs to have small overhead. I haven't thought this out completely, but it seems to me that something better should be possible. Evan Jones -- Evan Jones: http://evanjones.ca/ "Computers are useless. They can only give answers" - Pablo Picasso

On Tue, Oct 19, 2004 at 10:25:27AM -0400, Evan Jones wrote:
1. How can we run such an occasional task? It should be called regularly, if Python is actively executing or not, just like a garbage collector thread would be. I don't think making this collection part of the normal call to allocate/free objects is a good idea, because that would add some expensive overhead in what needs to be a fast path for Python.
Would attaching it to a less frequent level 1 garbage collection call work? (or even a level 2 if 1 is called too often)
participants (4)
-
Evan Jones
-
Gregory P. Smith
-
Luis P Caamano
-
Nick Coghlan