On Sat, 15 Jun 2019 19:56:58 -0500 Tim Peters firstname.lastname@example.org wrote:
At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that. A relative few weren't. Evan Jones wrote all the (considerable!) code to change that, and I massaged it and checked it in - not because there was "scientific proof" that it was more beneficial than harmful (it certainly added new expenses!) overall, but because it seemed like a right thing to do, _anticipating_ that the issue would become more important in coming years.
I'm still glad it was done, but no tests were checked in to _quantify_ its presumed benefits - or even to verify that it ever returned arenas to the system. Best I can tell, nobody actually has any informed idea how well it does. Evan stared at programs that were important to him, and fiddled things until he was "happy enough".
We moved from malloc() to mmap() for allocating arenas because of user requests to release memory more deterministically: https://bugs.python.org/issue11849
And given the number of people who use Python for long-running processes nowadays, I'm sure that they would notice (and be annoyed) if Python did not release memory after memory consumption spikes.
I've looked at obmalloc stats in other programs at various stages, and saw nothing concerning. memchunk.py appears to model object lifetimes as coming from a uniform distribution, but in real life they appear to be strongly multi-modal (with high peaks at the "way less than an eye blink" and "effectively immortal" ends).
I agree they will certainly be multi-modal, with the number of modes, their respective weights and their temporal distance widely dependent on use cases.
(the fact that they're multi-modal is the reason why generational GC is useful, btw)
We haven't been especially pro-active about giant machines, and are suffering from it:
So you're definitely trying to solve a problem, right?
Fixing the underlying cause put giant machines on my radar, and getting rid of obmalloc's pool size limit was the next obvious thing that would help them (although not in the same universe as cutting quadratic time to linear).
"Not in the same universe", indeed. So the question becomes: does the improvement increasing the pool and arena size have a negative outcome on *other* use cases?
Not everyone has giant machines. Actually a frequent usage model is to have many small VMs or containers on a medium-size machine.
For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
In relative terms, assuming that arenas are 50% full on average (probably a pessimistic assumption?), that overhead is 0.08% per arena memory used. What point is worrying about that?
If the problem is the cost of mmap() and munmap() calls, then the solution more or less exists at the system level: jemalloc and other allocators use madvise() with MADV_FREE (see here: https://lwn.net/Articles/593564/).
A possible design is a free list of arenas on which you use MADV_FREE to let the system know the memory *can* be reclaimed. When the free list overflows, call munmap() on extraneous arenas.
People can certainly pursue that if they like. I'm not interested in adding more complication that helps only one of obmalloc's slowest paths on only one platform.
MADV_FREE is available on multiple platforms (at least Linux, macOS, FreeBSD). Windows seems to offer similar facilities: https://devblogs.microsoft.com/oldnewthing/20170113-00/?p=95185
The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows.
That's assuming the dominating term in mmap() cost is O(1) rather than O(size). That's not a given. The system call cost is certainly O(1), but the cost of reserving and mapping HW pages, and zeroing them out is most certainly O(# pages).