[Python-Dev] Re: radix tree arena map for obmalloc

15 Jun 2019

      [Inada Naoki <songofacandy@gmail.com>. to Neil S]
...
Oh, do you mean your branch doesn't have headers in each page?
That's probably right ;-)  Neil is using a new data structure, a radix
tree implementing a sparse set of arena addresses.  Within obmalloc
pools, which can be of any multiple-of-4KiB (on a 64-bit box) size,
every byte beyond the pool header is usable for user data.  In my
patch, there is no new data structure, but it needs to store an "arena
index" at the start of every page (every 4K bytes) within a pool.

I certainly _like_ Neil's code better.  It's clean rather than
excruciatingly tricky.  The question is whether that's enough to
justify the memory burden of an additional data structure (which can
potentially grow very large).  So I've been working with Neil to see
whether it's possible to make it faster than my branch, to give it
another selling point people actually care about ;-)

Should also note that Neil's approach never needs to read
uninitialized memory, so we could throw away decades of (e.g.)
valgrind pain caused by the current approach (which my patch builds
on).
...
https://bugs.python.org/issue32846
As far as I remember, this bug was caused by cache thrashing (page
header is aligned by 4K, so cache line can conflict often.)
Or this bug can be caused by O(N) free() which is fixed already.
I doubt that report is relevant here, but anyone is free to try it
with Neil's branch.

    https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

However, last I looked there Neil was still using 4 KiB obmalloc
pools, all page-aligned.  But using much larger arenas (16 MiB, 16
times bigger than my branch, and 64 times bigger than Python currently
uses).

But the `O(N) free()` fix may very well be relevant.  To my eyes,
while there was plenty of speculation in that bug report, nobody
actually dug in deep to nail a specific cause.  A quick try just now
on my branch (which includes the `O(N) free()` fix) on Terry Reedy's
simple code in that report shows much improved behavior, until I run
out of RAM.

For example, roughly 4.3 seconds to delete 40 million strings in a
set, and 9.1 to delete 80 million in a set.  Not really linear, but
very far from quadratic.  In contrast, Terry saw nearly a quadrupling
of delete time when moving from 32M to 64M strings

So more than one thing was going on there, but looks likely that the
major pain was caused by quadratic-time arena list sorting.

[Python-Dev] Re: radix tree arena map for obmalloc

Tim Peters