[Python-Dev] Re: radix tree arena map for obmalloc

16 Jun 2019

      [Tim. to Neil]
...
...
Moving to bigger pools and bigger arenas are pretty much no-brainers
for us, [...]
[Antoine]
...
Why "no-brainers"?
We're running tests, benchmarks, the Python programs we always run,
Python programs that are important to us, staring at obmalloc stats
... and seeing nothing bad, nothing mysterious, only neutral-to-good
results.  So what's to think about? ;-)  These are 64-bit boxes, with
terabytes of virtual address space.  Asking for a meg of that is just
reserving less than a thousandth of a thousandth of that range of
integers, not actual physical RAM.
...
Bigger pools sound ok,
They're not necessarily independent choices.  Increase pool size
without increasing arena size, and the number of pools per arena
falls.  At the extreme, if pools were made the same size as arenas,
we'd need at least 32 arenas just to start Python (which uses at least
one pool of _every_ possible size class before you hit the prompt -
note that, on 64-bit boxes, the number of possible size classes is
falling from 64 (3.7) to 32 (3.8), due to some desire to make
everything aligned to 16 bytes - which I've already seen account for
some programs needing 100s of MB of more RAM).
...
but bigger arenas will make Python less likely to return memory to the system.
I know that gets repeated a lot, and I usually play along - but why do
people believe that?  Where's the evidence?

At the start, obmalloc never returned arenas to the system.  The vast
majority of users were fine with that.  A relative few weren't.  Evan
Jones wrote all the (considerable!) code to change that, and I
massaged it and checked it in - not because there was "scientific
proof" that it was more beneficial than harmful (it certainly added
new expenses!) overall, but because it seemed like a right thing to
do, _anticipating_ that the issue would become more important in
coming years.

I'm still glad it was done, but no tests were checked in to _quantify_
its presumed benefits - or even to verify that it ever returned arenas
to the system.  Best I can tell, nobody actually has any informed idea
how well it does.  Evan stared at programs that were important to him,
and fiddled things until he was "happy enough".

Not everyone was.  About five years ago, Kristján Valur Jónsson opened
this report:

    https://bugs.python.org/issue21220

suggesting a very different heuristic to try to free arenas.  The
"memcrunch..py" in his patch is the only time I've ever seen anyone
write code trying to measure whether obmalloc's arena-freeing is
effective.

I can verify that if you increase the number of objects in his script
by a factor of 100, my PR _never_ returns an arena to the system.  But
it's clear as mud to me whether his heuristic would  either (with the
100x smaller number of objects in the original script, the PR branch
does recycle arenas).

So that's the only objective evidence I have :-)

I've looked at obmalloc stats in other programs at various stages, and
saw nothing concerning.  memchunk.py appears to model object lifetimes
as coming from a uniform distribution, but in real life they appear to
be strongly multi-modal (with high peaks at the "way less than an eye
blink" and "effectively immortal" ends).
...
We should evaluate what problem we are trying to solve here,
I'm not trying to solve a problem.  This is a "right thing to do"
thing, anticipating that slinging a massive number of objects on
massive machines will become ever more important, and that changing
20-year-old constants will allow obmalloc to spend more time in its
fastest paths instead of its slowest.  We haven't been especially
pro-active about giant machines, and are suffering from it:

https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters...
"""
Update: Here is a “fun” Python performance bug that I ran into the
other day: deleting a set of 1 billion strings takes >12 hours.
Obviously, this particular instance can be fixed, but this exactly the
sort of thing that I would never have done a few years ago. A billion
strings seemed like a lot back then, but now we regularly discuss
multiple Terabytes of input data as “not a big deal”. This may not
apply for your settings, but it does for mine.
"""

That was _probably_ due to obmalloc's move-one-at-a-time way of
keeping its usable arenas list sorted, which sat un-diagnosed for over
a year.

    https://bugs.python.org/issue32846

Fixing the underlying cause put giant machines on my radar, and
getting rid of obmalloc's pool size limit was the next obvious thing
that would help them (although not in the same universe as cutting
quadratic time to linear).
...
instead of staring at micro-benchmark numbers on an idle system.
My only interest in those is that they're not slowing down, because
that's important too.  The aim here is much more to make life better
for programs slinging millions - even billions - of objects.
obmalloc's internal overheads are frugal, but not free.  For example,
it has to allocate at least 56 bytes of separate bookkeeping info for
each arena.  Nobody cares when they have 100 arenas, but when there
are a million arenas (which I've seen), that adds up.
...
Micro-benchmarks don't tell you what happens on a loaded system with
many processes, lots of I/O happening.
While running a loaded system with many processes and lots of I/O
doesn't tell you what happens in micro-benchmarks ;-)  A difference is
that performance of micro-benchmarks can be more-than-less reliably
measured, while nobody can do better than guess about how changes
affect a loaded system.  Changing address space reservations from less
than trivial to still less than trivial just isn't a _plausible_
source of disaster, at least not to my eyes.
...
If the problem is the cost of mmap() and munmap() calls, then the
solution more or less exists at the system level: jemalloc and other
allocators use madvise() with MADV_FREE (see here:
https://lwn.net/Articles/593564/).
A possible design is a free list of arenas on which you use MADV_FREE
to let the system know the memory *can* be reclaimed.  When the free
list overflows, call munmap() on extraneous arenas.
People can certainly pursue that if they like.  I'm not interested in
adding more complication that helps only one of obmalloc's slowest
paths on only one platform.  Increasing pool and arena sizes targets
all the slower paths on all platforms.

The dead obvious, dead simple, way to reduce mmap() expense is to call
it less often, which just requires changing a compile-time constant -
which will also call VirtualAlloc() equally less often on Windows.

The "real" problem you seem to care about is returning arenas to the
system.  I think more work should be done on that, but it's not the
goal of _this_ PR.  I'm not ignoring it, but so far I've seen no
reason to be worried (not in my PR - but how the much larger arenas
Neil generally uses fare in this respect is unknown to me).

[Python-Dev] Re: radix tree arena map for obmalloc

Tim Peters