We moved from malloc() to mmap() for allocating arenas because of user requests to release memory more deterministically:
Which was a good change! As was using VirtualAlloc() on Windows. None of that is being disputed. The change under discussion isn't about mmap - mmap only incidentally gets sucked in here because it's part of obmalloc's _the_ very slowest paths. Again, I'm aiming at _all_ of obmalloc's slower paths, on all platforms, at once. mmap isn't my focus.
That doesn't preclude anyone who cares lots about mmap from adding more complication to cater specifically to it.
And given the number of people who use Python for long-running processes nowadays, I'm sure that they would notice (and be annoyed) if Python did not release memory after memory consumption spikes.
The PR changes nothing about the "release arenas" heuristic or implementation. There's just unquantified speculation that boosting arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less trivial 1 MiB, may be disastrous. The only evidence for that so far is repetition of the speculation ;-)
... We haven't been especially pro-active about giant machines, and are suffering from it:
So you're definitely trying to solve a problem, right?
By my definition of "a problem", no. I have no quantified goals or any criteria to say "and now it's solved". I have open-ended concerns about how Python will fare on giant machines slinging billions of objects, and want to make things better _before_ users are provoked to complain. Which they'll do anyway, of course. I'm not trying to change human nature ;-)
So the question becomes: does the improvement increasing the pool and arena size have a negative outcome on *other* use cases?
Which is why Neil & I have been running benchmarks, tests, Python programs we run all the time, Python programs we care about ... and actively soliciting feedback from people actually trying the code on programs _they_ care about.
Not everyone has giant machines. Actually a frequent usage model is to have many small VMs or containers on a medium-size machine.
Something I've never done, so am wholly unqualified to judge. I don't even know what "many", "small", or "medium-size" might mean in this context. And I don't have a giant machine either, but spent most of my professional career in the "supercomputer" business so at least understand how those kinds of people think ;-)
For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
In relative terms, assuming that arenas are 50% full on average (probably a pessimistic assumption?), that overhead is 0.08% per arena memory used. What point is worrying about that?
You're only looking at one cost. Those bytes aren't just address reservations, they consume actual physical RAM. The bookkeeping info is periodically read, and mutated, over time. In aggregate, that's more than enough bytes to blow away an entire L3 cache. The less of that stuff needs to be read and written (i.e., the fewer arenas there are), the less pressure that puts on the faster layers (caches) of the memory hierarchy.
That bookkeeping info is also immortal: all the bookkeeping arena_object structs live a single contiguously allocated vector. It may move over time (via realloc()), but can never shrink, only grow. Asking the platform malloc/realloc for a 50 MB chunk sucks on the face of it :-(
So while the 50M is first-order trivial when a million arenas are in use, if the program enters a new phase releasing almost all of it, leaving (say) only 10 arenas still active, the 50 M is still there, effectively wasting 200 arenas' worth of "invisible" (to _debugmallocstats()) space forever.
About typical arena usage, I expect, but don't know, that 50% is quite pessimistic. It's a failure of project management (starting with me) that _every_ step in the "free arenas" evolution was driven by a user complaining about their program, and that nothing was ever checked in to even ensure their problem remained solved, let alone to help find out how effective arena-releasing is in an arbitrary program. We've been flying blind from the start, and remain blind.
That said, over the decades I've often looked at obmalloc stats, and have generally been pleasantly surprised at how much of allocated space is being put to good use. 80% seems more typical than 50% to me based on that.
It's easy enough to contrive programs tending toward only 16 bytes in use per arena, but nobody has ever reported anything like that. The aforementioned memcrunch.py program wasn't particularly contrived, but was added to a bug report to "prove" that obmalloc's arena releasing strategy was poor.
Here are some stats from running that under my PR, but using 200 times the initial number of objects as the original script:
n = 20000000 #number of things
At the end, with 1M arena and 16K pool:
3362 arenas * 1048576 bytes/arena = 3,525,312,512 # bytes in allocated blocks = 1,968,233,888
WIth 256K arena and 4K pool:
13375 arenas * 262144 bytes/arena = 3,506,176,000 # bytes in allocated blocks = 1,968,233,888
So even there over 50% of arena space was in allocated blocks at the end. Total arena space remained essentially the same either way.
However, with smaller arenas the peak memory use was _higher_ It did manage to release over 100 arenas (about 1% of the total ever allocated). With the larger arenas, none were ever released.
Either way, 32,921,097 objects remained in use at the end. _ If_ the program had gone on to create another mass of new objects, the big-arena version was better prepared for it: it had 8,342,661 free blocks of the right size class ready to reuse, but the small-arena version had 2,951,103.
That is, releasing address space isn't a pure win either - it only pays if it so happens that the space won't be needed soon again. If the space is so needed, releasing the space was a waste of time.
There is no "hard science" in poke-&-hope ;-)
..... The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows.
That's assuming the dominating term in mmap() cost is O(1) rather than O(size). That's not a given. The system call cost is certainly O(1), but the cost of reserving and mapping HW pages, and zeroing them out is most certainly O(# pages).
Good point! I grossly overstated it. But since I wasn't focused on mmap to begin with, it doesn't change any of my motivations for wanting to boost pool and arena sizes.
BTW, anyone keen to complicate the mmap management should first take this recent change into account::
That appears to have killed off _the_ most overwhelmingly common cause of obmalloc counter-productively releasing an arena only to create a new one again milliseconds later.
My branch, and Neil's, both contain that change, which makes it much harder to compare our branches' obmalloc arena stats with 3.7. It turns out that a whole lot of "released arenas" under 3.7 (and will still be so in 3.8) were due to that worse-than-useless arena thrashing.