[Python-Dev] Re: radix tree arena map for obmalloc

17 Jun 2019

      [Antoine]
...
We moved from malloc() to mmap() for allocating arenas because of user
requests to release memory more deterministically:
https://bugs.python.org/issue11849
Which was a good change!  As was using VirtualAlloc() on Windows.
None of that is being disputed.  The change under discussion isn't
about mmap - mmap only incidentally gets sucked in here because it's
part of obmalloc's _the_ very slowest paths.  Again, I'm aiming at
_all_ of obmalloc's slower paths, on all platforms, at once.  mmap
isn't my focus.

That doesn't preclude anyone who cares lots about mmap from adding
more complication to cater specifically to it.
...
And given the number of people who use Python for long-running
processes nowadays, I'm sure that they would notice (and be annoyed) if
Python did not release memory after memory consumption spikes.
The PR changes nothing about the "release arenas" heuristic or
implementation.  There's just unquantified speculation that boosting
arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less
trivial 1 MiB, may be disastrous.  The only evidence for that so far
is repetition of the speculation ;-)
...
...
...
We haven't been especially
pro-active about giant machines, and are suffering from it:
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters...
...
So you're definitely trying to solve a problem, right?
By my definition of "a problem", no.  I have no quantified goals or
any criteria to say "and now it's solved".  I have open-ended concerns
about how Python will fare on giant machines slinging billions of
objects, and want to make things better _before_ users are provoked to
complain.  Which they'll do anyway, of course.  I'm not trying to
change human nature ;-)
...
So the question becomes: does the improvement increasing the pool
and arena size have a negative outcome on *other* use cases?
Which is why Neil & I have been running benchmarks, tests, Python
programs we run all the time, Python programs we care about ... and
actively soliciting feedback from people actually trying the code on
programs _they_ care about.
...
Not everyone has giant machines.  Actually a frequent usage model is to
have many small VMs or containers on a medium-size machine.
Something I've never done, so am wholly unqualified to judge.  I don't
even know what "many", "small", or "medium-size" might mean in this
context.  And I don't have a giant machine either, but spent most of
my professional career in the "supercomputer" business so at least
understand how those kinds of people think ;-)
...
...
For example, it has to allocate at least 56 bytes of separate bookkeeping info
for each arena.  Nobody cares when they have 100 arenas, but when there
are a million arenas (which I've seen), that adds up.
...
In relative terms, assuming that arenas are 50% full on average
(probably a pessimistic assumption?), that overhead is 0.08% per arena
memory used.  What point is worrying about that?
You're only looking at one cost.  Those bytes aren't just address
reservations, they consume actual physical RAM.  The bookkeeping info
is periodically read, and mutated, over time.  In aggregate, that's
more than enough bytes to blow away an entire L3 cache.  The less of
that stuff needs to be read and written (i.e., the fewer arenas there
are), the less pressure that puts on the faster layers (caches) of the
memory hierarchy.

That bookkeeping info is also immortal:  all the bookkeeping
arena_object structs live a single contiguously allocated vector.  It
may move over time (via realloc()), but can never shrink, only grow.
Asking the platform malloc/realloc for a 50 MB chunk sucks on the face
of it :-(

So while the 50M is first-order trivial when a million arenas are in
use, if the program enters a new phase releasing almost all of it,
leaving (say) only 10 arenas still active, the 50 M is still there,
effectively wasting 200 arenas' worth of "invisible" (to
_debugmallocstats()) space forever.

About typical arena usage, I expect, but don't know, that 50% is quite
pessimistic.  It's a failure of project management (starting with me)
that _every_ step in the "free arenas" evolution was driven by a user
complaining about their program, and that nothing was ever checked in
to even ensure their problem remained solved, let alone to help find
out how effective arena-releasing is in an arbitrary program.  We've
been flying blind from the start, and remain blind.

That said, over the decades I've often looked at obmalloc stats, and
have generally been pleasantly surprised at how much of allocated
space is being put to good use.  80% seems more typical than 50% to me
based on that.

It's easy enough to contrive programs tending toward only 16 bytes in
use per arena, but nobody has ever reported anything like that.  The
aforementioned memcrunch.py program wasn't particularly contrived, but
was added to a bug report to "prove" that obmalloc's arena releasing
strategy was poor.

Here are some stats from running that under my PR, but using 200 times
the initial number of objects as the original script:

n = 20000000 #number of things

At the end, with 1M arena and 16K pool:

3362 arenas * 1048576 bytes/arena  =        3,525,312,512
# bytes in allocated blocks        =        1,968,233,888

WIth 256K arena and 4K pool:

13375 arenas * 262144 bytes/arena  =        3,506,176,000
# bytes in allocated blocks        =        1,968,233,888

So even there over 50% of arena space was in allocated blocks at the
end.  Total arena space remained essentially the same either way.

However, with smaller arenas the peak memory use was _higher_  It did
manage to release over 100 arenas (about 1% of the total ever
allocated).  With the larger arenas, none were ever released.

Either way, 32,921,097 objects remained in use at the end. _ If_ the
program had gone on to create another mass of new objects, the
big-arena version was better prepared for it:  it had 8,342,661 free
blocks of the right size class ready to reuse, but the small-arena
version had 2,951,103.

That is, releasing address space isn't a pure win either - it only
pays if it so happens that the space won't be needed soon again.  If
the space is so needed, releasing the space was a waste of time.

There is no "hard science" in poke-&-hope ;-)
...
...
.....
The dead obvious, dead simple, way to reduce mmap() expense is to call
it less often, which just requires changing a compile-time constant -
which will also call VirtualAlloc() equally less often on Windows.
...
That's assuming the dominating term in mmap() cost is O(1) rather than
O(size).  That's not a given.  The system call cost is certainly O(1),
but the cost of reserving and mapping HW pages, and zeroing them out is
most certainly O(# pages).
Good point!  I grossly overstated it.  But since I wasn't focused on
mmap to begin with, it doesn't change any of my motivations for
wanting to boost pool and arena sizes.

BTW, anyone keen to complicate the mmap management should first take
this recent change into account::

    https://bugs.python.org/issue37257

That appears to have killed off _the_ most overwhelmingly common cause
of obmalloc counter-productively releasing an arena only to create a
new one again milliseconds later.

My branch, and Neil's, both contain that change, which makes it much
harder to compare our branches' obmalloc arena stats with 3.7.  It
turns out that a whole lot of "released arenas" under 3.7 (and will
still be so in 3.8) were due to that worse-than-useless arena
thrashing.

[Python-Dev] Re: radix tree arena map for obmalloc

Tim Peters