[Inada Naoki <songofacandy@gmail.com>. to Neil S]
Oh, do you mean your branch doesn't have headers in each page?
That's probably right ;-) Neil is using a new data structure, a radix tree implementing a sparse set of arena addresses. Within obmalloc pools, which can be of any multiple-of-4KiB (on a 64-bit box) size, every byte beyond the pool header is usable for user data. In my patch, there is no new data structure, but it needs to store an "arena index" at the start of every page (every 4K bytes) within a pool. I certainly _like_ Neil's code better. It's clean rather than excruciatingly tricky. The question is whether that's enough to justify the memory burden of an additional data structure (which can potentially grow very large). So I've been working with Neil to see whether it's possible to make it faster than my branch, to give it another selling point people actually care about ;-) Should also note that Neil's approach never needs to read uninitialized memory, so we could throw away decades of (e.g.) valgrind pain caused by the current approach (which my patch builds on).
https://bugs.python.org/issue32846
As far as I remember, this bug was caused by cache thrashing (page header is aligned by 4K, so cache line can conflict often.) Or this bug can be caused by O(N) free() which is fixed already.
I doubt that report is relevant here, but anyone is free to try it with Neil's branch. https://github.com/nascheme/cpython/tree/obmalloc_radix_tree However, last I looked there Neil was still using 4 KiB obmalloc pools, all page-aligned. But using much larger arenas (16 MiB, 16 times bigger than my branch, and 64 times bigger than Python currently uses). But the `O(N) free()` fix may very well be relevant. To my eyes, while there was plenty of speculation in that bug report, nobody actually dug in deep to nail a specific cause. A quick try just now on my branch (which includes the `O(N) free()` fix) on Terry Reedy's simple code in that report shows much improved behavior, until I run out of RAM. For example, roughly 4.3 seconds to delete 40 million strings in a set, and 9.1 to delete 80 million in a set. Not really linear, but very far from quadratic. In contrast, Terry saw nearly a quadrupling of delete time when moving from 32M to 64M strings So more than one thing was going on there, but looks likely that the major pain was caused by quadratic-time arena list sorting.