[Tim. to Neil]
Moving to bigger pools and bigger arenas are pretty much no-brainers for us, [...]
We're running tests, benchmarks, the Python programs we always run, Python programs that are important to us, staring at obmalloc stats ... and seeing nothing bad, nothing mysterious, only neutral-to-good results. So what's to think about? ;-) These are 64-bit boxes, with terabytes of virtual address space. Asking for a meg of that is just reserving less than a thousandth of a thousandth of that range of integers, not actual physical RAM.
Bigger pools sound ok,
They're not necessarily independent choices. Increase pool size without increasing arena size, and the number of pools per arena falls. At the extreme, if pools were made the same size as arenas, we'd need at least 32 arenas just to start Python (which uses at least one pool of _every_ possible size class before you hit the prompt - note that, on 64-bit boxes, the number of possible size classes is falling from 64 (3.7) to 32 (3.8), due to some desire to make everything aligned to 16 bytes - which I've already seen account for some programs needing 100s of MB of more RAM).
but bigger arenas will make Python less likely to return memory to the system.
I know that gets repeated a lot, and I usually play along - but why do people believe that? Where's the evidence?
At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that. A relative few weren't. Evan Jones wrote all the (considerable!) code to change that, and I massaged it and checked it in - not because there was "scientific proof" that it was more beneficial than harmful (it certainly added new expenses!) overall, but because it seemed like a right thing to do, _anticipating_ that the issue would become more important in coming years.
I'm still glad it was done, but no tests were checked in to _quantify_ its presumed benefits - or even to verify that it ever returned arenas to the system. Best I can tell, nobody actually has any informed idea how well it does. Evan stared at programs that were important to him, and fiddled things until he was "happy enough".
Not everyone was. About five years ago, Kristján Valur Jónsson opened this report:
suggesting a very different heuristic to try to free arenas. The "memcrunch..py" in his patch is the only time I've ever seen anyone write code trying to measure whether obmalloc's arena-freeing is effective.
I can verify that if you increase the number of objects in his script by a factor of 100, my PR _never_ returns an arena to the system. But it's clear as mud to me whether his heuristic would either (with the 100x smaller number of objects in the original script, the PR branch does recycle arenas).
So that's the only objective evidence I have :-)
I've looked at obmalloc stats in other programs at various stages, and saw nothing concerning. memchunk.py appears to model object lifetimes as coming from a uniform distribution, but in real life they appear to be strongly multi-modal (with high peaks at the "way less than an eye blink" and "effectively immortal" ends).
We should evaluate what problem we are trying to solve here,
I'm not trying to solve a problem. This is a "right thing to do" thing, anticipating that slinging a massive number of objects on massive machines will become ever more important, and that changing 20-year-old constants will allow obmalloc to spend more time in its fastest paths instead of its slowest. We haven't been especially pro-active about giant machines, and are suffering from it:
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters... """ Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine. """
That was _probably_ due to obmalloc's move-one-at-a-time way of keeping its usable arenas list sorted, which sat un-diagnosed for over a year.
Fixing the underlying cause put giant machines on my radar, and getting rid of obmalloc's pool size limit was the next obvious thing that would help them (although not in the same universe as cutting quadratic time to linear).
instead of staring at micro-benchmark numbers on an idle system.
My only interest in those is that they're not slowing down, because that's important too. The aim here is much more to make life better for programs slinging millions - even billions - of objects. obmalloc's internal overheads are frugal, but not free. For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
Micro-benchmarks don't tell you what happens on a loaded system with many processes, lots of I/O happening.
While running a loaded system with many processes and lots of I/O doesn't tell you what happens in micro-benchmarks ;-) A difference is that performance of micro-benchmarks can be more-than-less reliably measured, while nobody can do better than guess about how changes affect a loaded system. Changing address space reservations from less than trivial to still less than trivial just isn't a _plausible_ source of disaster, at least not to my eyes.
If the problem is the cost of mmap() and munmap() calls, then the solution more or less exists at the system level: jemalloc and other allocators use madvise() with MADV_FREE (see here: https://lwn.net/Articles/593564/).
A possible design is a free list of arenas on which you use MADV_FREE to let the system know the memory *can* be reclaimed. When the free list overflows, call munmap() on extraneous arenas.
People can certainly pursue that if they like. I'm not interested in adding more complication that helps only one of obmalloc's slowest paths on only one platform. Increasing pool and arena sizes targets all the slower paths on all platforms.
The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows.
The "real" problem you seem to care about is returning arenas to the system. I think more work should be done on that, but it's not the goal of _this_ PR. I'm not ignoring it, but so far I've seen no reason to be worried (not in my PR - but how the much larger arenas Neil generally uses fare in this respect is unknown to me).