radix tree arena map for obmalloc
I've been working on this idea for a couple of days. Tim Peters has being helping me out and I think it has come far enough to get some more feedback. It is not yet a good replacement for the current address_in_range() test. However, performance wise, it is very close. Tim figures we are not done optimizing it yet so maybe it will get better. Code is available on my github branch: https://github.com/nascheme/cpython/tree/obmalloc_radix_tree Tim's "obmalloc-big-pools" is what I have been comparing it to. It seems 8 KB pools are faster than 4 KB. I applied Tim's arena trashing fix (bpo-37257) to both branches. Some rough (--fast) pyperformance benchmark results are below. +-------------------------+---------------------+-----------------------------+ | Benchmark | obmalloc-big-pools | obmalloc_radix | +=========================+=====================+=============================+ | crypto_pyaes | 168 ms | 170 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | hexiom | 13.7 ms | 13.6 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | json_dumps | 15.9 ms | 15.6 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | json_loads | 36.9 us | 37.1 us: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | meteor_contest | 141 ms | 139 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | nqueens | 137 ms | 140 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | pickle_dict | 26.2 us | 25.9 us: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | pickle_list | 3.91 us | 3.94 us: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | python_startup_no_site | 8.00 ms | 7.78 ms: 1.03x faster (-3%) | +-------------------------+---------------------+-----------------------------+ | regex_dna | 246 ms | 241 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | regex_v8 | 29.6 ms | 30.0 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | richards | 93.9 ms | 92.7 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | scimark_fft | 525 ms | 531 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | scimark_sparse_mat_mult | 6.32 ms | 6.24 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | spectral_norm | 195 ms | 198 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | sqlalchemy_imperative | 49.5 ms | 50.5 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | sympy_expand | 691 ms | 695 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | unpickle_list | 5.09 us | 5.32 us: 1.04x slower (+4%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_parse | 213 ms | 215 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_generate | 134 ms | 136 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_process | 103 ms | 104 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ Not significant (34): 2to3; chameleon; chaos; deltablue; django_template; dulwich_log; fannkuch; float; go; html5lib; logging_format; logging_silent; logging_simple; mako; nbody; pathlib; pickle; pidigits; python_startup; raytrace; regex_compile; regex_effbot; scimark_lu; scimark_monte_carlo; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_integrate; sympy_sum; sympy_str; telco; unpack_sequence; unpickle; xml_etree_iterparse
Oh, do you mean your branch doesn't have headers in each page? https://bugs.python.org/issue32846 As far as I remember, this bug was caused by cache thrashing (page header is aligned by 4K, so cache line can conflict often.) Or this bug can be caused by O(N) free() which is fixed already. I'll see it in next week. On Sat, Jun 15, 2019 at 3:54 AM Neil Schemenauer <nas-python@arctrix.com> wrote:
I've been working on this idea for a couple of days. Tim Peters has being helping me out and I think it has come far enough to get some more feedback. It is not yet a good replacement for the current address_in_range() test. However, performance wise, it is very close. Tim figures we are not done optimizing it yet so maybe it will get better.
Code is available on my github branch:
https://github.com/nascheme/cpython/tree/obmalloc_radix_tree
Tim's "obmalloc-big-pools" is what I have been comparing it to. It seems 8 KB pools are faster than 4 KB. I applied Tim's arena trashing fix (bpo-37257) to both branches. Some rough (--fast) pyperformance benchmark results are below.
+-------------------------+---------------------+-----------------------------+ | Benchmark | obmalloc-big-pools | obmalloc_radix | +=========================+=====================+=============================+ | crypto_pyaes | 168 ms | 170 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | hexiom | 13.7 ms | 13.6 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | json_dumps | 15.9 ms | 15.6 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | json_loads | 36.9 us | 37.1 us: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | meteor_contest | 141 ms | 139 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | nqueens | 137 ms | 140 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | pickle_dict | 26.2 us | 25.9 us: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | pickle_list | 3.91 us | 3.94 us: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | python_startup_no_site | 8.00 ms | 7.78 ms: 1.03x faster (-3%) | +-------------------------+---------------------+-----------------------------+ | regex_dna | 246 ms | 241 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | regex_v8 | 29.6 ms | 30.0 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | richards | 93.9 ms | 92.7 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | scimark_fft | 525 ms | 531 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | scimark_sparse_mat_mult | 6.32 ms | 6.24 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | spectral_norm | 195 ms | 198 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | sqlalchemy_imperative | 49.5 ms | 50.5 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | sympy_expand | 691 ms | 695 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | unpickle_list | 5.09 us | 5.32 us: 1.04x slower (+4%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_parse | 213 ms | 215 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_generate | 134 ms | 136 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_process | 103 ms | 104 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+
Not significant (34): 2to3; chameleon; chaos; deltablue; django_template; dulwich_log; fannkuch; float; go; html5lib; logging_format; logging_silent; logging_simple; mako; nbody; pathlib; pickle; pidigits; python_startup; raytrace; regex_compile; regex_effbot; scimark_lu; scimark_monte_carlo; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_integrate; sympy_sum; sympy_str; telco; unpack_sequence; unpickle; xml_etree_iterparse _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZDFEYEC4...
-- Inada Naoki <songofacandy@gmail.com>
[Inada Naoki <songofacandy@gmail.com>. to Neil S]
Oh, do you mean your branch doesn't have headers in each page?
That's probably right ;-) Neil is using a new data structure, a radix tree implementing a sparse set of arena addresses. Within obmalloc pools, which can be of any multiple-of-4KiB (on a 64-bit box) size, every byte beyond the pool header is usable for user data. In my patch, there is no new data structure, but it needs to store an "arena index" at the start of every page (every 4K bytes) within a pool. I certainly _like_ Neil's code better. It's clean rather than excruciatingly tricky. The question is whether that's enough to justify the memory burden of an additional data structure (which can potentially grow very large). So I've been working with Neil to see whether it's possible to make it faster than my branch, to give it another selling point people actually care about ;-) Should also note that Neil's approach never needs to read uninitialized memory, so we could throw away decades of (e.g.) valgrind pain caused by the current approach (which my patch builds on).
https://bugs.python.org/issue32846
As far as I remember, this bug was caused by cache thrashing (page header is aligned by 4K, so cache line can conflict often.) Or this bug can be caused by O(N) free() which is fixed already.
I doubt that report is relevant here, but anyone is free to try it with Neil's branch. https://github.com/nascheme/cpython/tree/obmalloc_radix_tree However, last I looked there Neil was still using 4 KiB obmalloc pools, all page-aligned. But using much larger arenas (16 MiB, 16 times bigger than my branch, and 64 times bigger than Python currently uses). But the `O(N) free()` fix may very well be relevant. To my eyes, while there was plenty of speculation in that bug report, nobody actually dug in deep to nail a specific cause. A quick try just now on my branch (which includes the `O(N) free()` fix) on Terry Reedy's simple code in that report shows much improved behavior, until I run out of RAM. For example, roughly 4.3 seconds to delete 40 million strings in a set, and 9.1 to delete 80 million in a set. Not really linear, but very far from quadratic. In contrast, Terry saw nearly a quadrupling of delete time when moving from 32M to 64M strings So more than one thing was going on there, but looks likely that the major pain was caused by quadratic-time arena list sorting.
On 2019-06-14, Tim Peters wrote:
However, last I looked there Neil was still using 4 KiB obmalloc pools, all page-aligned. But using much larger arenas (16 MiB, 16 times bigger than my branch, and 64 times bigger than Python currently uses).
I was testing it verses your obmalloc-big-pool branch and trying to make it a fair comparision. You are correct: 4 KiB pools and 16 MiB arenas. Maybe I should test with 16 KiB pools and 16 MiB arenas. That seems a more optimized setting for current machines and workloads.
Here are benchmark results for 64 MB arenas and 16 kB pools. I ran without the --fast option and on a Linux machine in single user mode. The "base" columm is the obmalloc-big-pools branch with ARENA_SIZE = 64 MB and POOL_SIZE = 16 kB. The "radix" column is obmalloc_radix_tree (commit 5e00f6041) with the same arena and pool sizes. +-------------------------+---------------------+-----------------------------+ | Benchmark | base (16kB/64MB) | radix (16KB/64MB) | +=========================+=====================+=============================+ | 2to3 | 290 ms | 292 ms: 1.00x slower (+0%) | +-------------------------+---------------------+-----------------------------+ | crypto_pyaes | 114 ms | 116 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | django_template | 109 ms | 106 ms: 1.03x faster (-3%) | +-------------------------+---------------------+-----------------------------+ | dulwich_log | 75.2 ms | 74.5 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | fannkuch | 454 ms | 449 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | float | 113 ms | 111 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | hexiom | 9.45 ms | 9.47 ms: 1.00x slower (+0%) | +-------------------------+---------------------+-----------------------------+ | json_dumps | 10.6 ms | 11.1 ms: 1.04x slower (+4%) | +-------------------------+---------------------+-----------------------------+ | json_loads | 24.4 us | 25.2 us: 1.03x slower (+3%) | +-------------------------+---------------------+-----------------------------+ | logging_simple | 8.19 us | 8.37 us: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | mako | 15.1 ms | 15.1 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | meteor_contest | 98.3 ms | 97.1 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | nbody | 142 ms | 140 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | nqueens | 93.8 ms | 93.0 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | pickle | 8.89 us | 8.85 us: 1.01x faster (-0%) | +-------------------------+---------------------+-----------------------------+ | pickle_dict | 17.9 us | 18.2 us: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | pickle_list | 2.68 us | 2.64 us: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | pidigits | 182 ms | 184 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | python_startup_no_site | 5.31 ms | 5.33 ms: 1.00x slower (+0%) | +-------------------------+---------------------+-----------------------------+ | raytrace | 483 ms | 476 ms: 1.02x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | regex_compile | 167 ms | 169 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | regex_dna | 170 ms | 171 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | regex_effbot | 2.70 ms | 2.75 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | regex_v8 | 21.1 ms | 21.3 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | scimark_fft | 368 ms | 371 ms: 1.01x slower (+1%) | +-------------------------+---------------------+-----------------------------+ | scimark_monte_carlo | 103 ms | 101 ms: 1.02x faster (-2%) | +-------------------------+---------------------+-----------------------------+ | scimark_sparse_mat_mult | 4.31 ms | 4.27 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | spectral_norm | 131 ms | 135 ms: 1.03x slower (+3%) | +-------------------------+---------------------+-----------------------------+ | sqlite_synth | 2.89 us | 2.99 us: 1.03x slower (+3%) | +-------------------------+---------------------+-----------------------------+ | sympy_expand | 491 ms | 502 ms: 1.02x slower (+2%) | +-------------------------+---------------------+-----------------------------+ | sympy_integrate | 20.9 ms | 20.8 ms: 1.00x faster (-0%) | +-------------------------+---------------------+-----------------------------+ | sympy_sum | 99.0 ms | 98.2 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | telco | 6.64 ms | 6.86 ms: 1.03x slower (+3%) | +-------------------------+---------------------+-----------------------------+ | unpickle | 12.3 us | 13.3 us: 1.09x slower (+9%) | +-------------------------+---------------------+-----------------------------+ | unpickle_list | 3.56 us | 3.68 us: 1.03x slower (+3%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_iterparse | 101 ms | 101 ms: 1.01x faster (-1%) | +-------------------------+---------------------+-----------------------------+ | xml_etree_process | 70.8 ms | 70.7 ms: 1.00x faster (-0%) | +-------------------------+---------------------+-----------------------------+ Not significant (18): chaos; deltablue; go; html5lib; logging_format; logging_silent; pathlib; python_startup; richards; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlalchemy_imperative; sympy_str; tornado_http; unpack_sequence; xml_etree_parse; xml_etree_generate +------------------------+---------------------+-----------------------------+ | Benchmark | radix (4kB/16MB) | radix (16kB/64MB) | +========================+=====================+=============================+ | chaos | 108 ms | 107 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | crypto_pyaes | 116 ms | 116 ms: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | django_template | 111 ms | 106 ms: 1.05x faster (-4%) | +------------------------+---------------------+-----------------------------+ | dulwich_log | 74.9 ms | 74.5 ms: 1.00x faster (-0%) | +------------------------+---------------------+-----------------------------+ | float | 112 ms | 111 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | hexiom | 9.44 ms | 9.47 ms: 1.00x slower (+0%) | +------------------------+---------------------+-----------------------------+ | json_dumps | 10.6 ms | 11.1 ms: 1.04x slower (+4%) | +------------------------+---------------------+-----------------------------+ | json_loads | 24.5 us | 25.2 us: 1.03x slower (+3%) | +------------------------+---------------------+-----------------------------+ | logging_silent | 182 ns | 179 ns: 1.02x faster (-2%) | +------------------------+---------------------+-----------------------------+ | logging_simple | 8.28 us | 8.37 us: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | meteor_contest | 98.2 ms | 97.1 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | nbody | 141 ms | 140 ms: 1.00x faster (-0%) | +------------------------+---------------------+-----------------------------+ | nqueens | 93.7 ms | 93.0 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | pathlib | 18.5 ms | 18.9 ms: 1.02x slower (+2%) | +------------------------+---------------------+-----------------------------+ | pickle | 8.95 us | 8.85 us: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | pickle_dict | 18.0 us | 18.2 us: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | pickle_list | 2.69 us | 2.64 us: 1.02x faster (-2%) | +------------------------+---------------------+-----------------------------+ | pidigits | 182 ms | 184 ms: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | python_startup | 7.74 ms | 7.70 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | python_startup_no_site | 5.34 ms | 5.33 ms: 1.00x faster (-0%) | +------------------------+---------------------+-----------------------------+ | regex_dna | 170 ms | 171 ms: 1.00x slower (+0%) | +------------------------+---------------------+-----------------------------+ | regex_effbot | 2.72 ms | 2.75 ms: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | scimark_fft | 359 ms | 371 ms: 1.03x slower (+3%) | +------------------------+---------------------+-----------------------------+ | scimark_monte_carlo | 100 ms | 101 ms: 1.01x slower (+1%) | +------------------------+---------------------+-----------------------------+ | spectral_norm | 133 ms | 135 ms: 1.02x slower (+2%) | +------------------------+---------------------+-----------------------------+ | sqlite_synth | 2.88 us | 2.99 us: 1.04x slower (+4%) | +------------------------+---------------------+-----------------------------+ | sympy_integrate | 21.0 ms | 20.8 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | sympy_sum | 101 ms | 98.2 ms: 1.03x faster (-3%) | +------------------------+---------------------+-----------------------------+ | sympy_str | 209 ms | 206 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | telco | 6.61 ms | 6.86 ms: 1.04x slower (+4%) | +------------------------+---------------------+-----------------------------+ | tornado_http | 170 ms | 169 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | unpickle | 12.7 us | 13.3 us: 1.05x slower (+5%) | +------------------------+---------------------+-----------------------------+ | xml_etree_iterparse | 102 ms | 101 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | xml_etree_generate | 93.0 ms | 92.3 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ | xml_etree_process | 71.4 ms | 70.7 ms: 1.01x faster (-1%) | +------------------------+---------------------+-----------------------------+ Not significant (20): 2to3; deltablue; fannkuch; go; html5lib; logging_format; mako; raytrace; regex_compile; regex_v8; richards; scimark_lu; scimark_sor; scimark_sparse_mat_mult; sqlalchemy_declarative; sqlalchemy_imperative; sympy_expand; unpack_sequence; unpickle_list; xml_etree_parse
On 2019-06-15, Inada Naoki wrote:
Oh, do you mean your branch doesn't have headers in each page?
That's right. Each pool still has a header but pools can be larger than the page size. Tim's obmalloc-big-pool idea writes something to the head of each page within a pool. The radix tree doesn't need that and actually doesn't care about OS page size. BTW, the current radix tree doesn't even require that pools are aligned to POOL_SIZE. We probably want to keep pools aligned because other parts of obmalloc rely on that. Here is the matchup of the radix tree vs the current address_in_range() approach. - nearly the same in terms of performance. It might depend on OS and workload but based on my testing on Linux, they are very close. Would be good to do more testing but I think the radix tree is not going to be faster, only slower. - radix tree uses a bit more memory overhead. Maybe 1 or 2 MiB on a 64-bit OS. The radix tree uses more as memory use goes up but it is a small fraction of total used memory. The extra memory use is the main downside at this point, I think. - the radix tree doesn't read uninitialized memory. The current address_in_range() approach has worked very well but is relying on some assumptions about the OS (how it maps pages into the program address space). This is the only aspect where the radix tree is clearly better. I'm not sure this matters enough to offset the extra memory use. - IMHO, the radix tree code is a bit simpler than Tim's obmalloc-big-pool code. That's not a big deal though as long as the code works and is well commented (which Tim's code is). My feeling right now is that Tim's obmalloc-big-pool is the best design at this point. Using 8 KB or 16 KB pools seems to be better than 4 KB. The extra complexity added by Tim's change is not so nice. obmalloc is already extremely subtle and obmalloc-big-pool makes it more so. Regards, Neil
[Neil Schemenauer <nas-python@arctrix.com>]
... BTW, the current radix tree doesn't even require that pools are aligned to POOL_SIZE. We probably want to keep pools aligned because other parts of obmalloc rely on that.
obmalloc relies on it heavily. Another radix tree could map block addresses to all the necessary info in a current pool header, but that could become gigantic. For example, even a current "teensy" 256 KiB arena can be carved into the order of 16K 16-byte blocks (the smallest size class). And finding a pool header now is unbeatably cheap: just clear the last 12 address bits, and you're done.
Here is the matchup of the radix tree vs the current address_in_range() approach.
- nearly the same in terms of performance. It might depend on OS and workload but based on my testing on Linux, they are very close. Would be good to do more testing but I think the radix tree is not going to be faster, only slower.
People should understand that the only point to these things is to determine whether a pointer passed to a free() or realloc() spelling was obtained from an obmalloc pool or from the system malloc family. So they're invoked once near the very starts of those two functions, and that's all. Both ways are much more expensive than finding a pool header (which is just clearing some trailing address bits). The current way reads an "arena index" out of the pool header, uses that to index the file static `arenas` vector to get a pointer to an arena descriptor, then reads the arena base address out of the descriptor. That;s used to determine whether the original address is contained in the arena. The primary change my PR makes is to read the arena index from the start of the _page_ the address belongs instead (the pool header address is irrelevant to this, apart from that a pool header is aligned to the first page in a pool). The radix tree picks bits out of the address three times to index into a 3-level (but potentially broad) tree, ending with a node containing info about the only two possible arenas the original address may belong to. Then that info is used to check. The number of loads is essentially the same, but the multiple levels of indexing in the tree is a teensy bit more expensive because it requires more bit-fiddling. I spent hours, in all, dreaming up a way to make the _seemingly_ more complex final "so is the address in one of those two arenas or not?" check about as cheap as the current way. But Neil didn't see any significant timing difference after making that change, which was mildly disappointing but not really surprising: arithmetic is just plain cheap compared to reading up memory.
- radix tree uses a bit more memory overhead. Maybe 1 or 2 MiB on a 64-bit OS. The radix tree uses more as memory use goes up but it is a small fraction of total used memory. The extra memory use is the main downside at this point, I think.
I'd call it the only downside. But nobody yet has quantified how bad it can get.
- the radix tree doesn't read uninitialized memory. The current address_in_range() approach has worked very well but is relying on some assumptions about the OS (how it maps pages into the program address space). This is the only aspect where the radix tree is clearly better. I'm not sure this matters enough to offset the extra memory use.
I'm not worried about that. The only real assumption here is that if an OS supports some notion of "pages" at all, then for any address for which the program has read/write access (which are the only kinds of addresses that can be sanely passed to free/realloc), the OS allows the same access to the entire page containing that address. In two decades we haven't seen an exception to that yet, right? It's hard to imagine a HW designer thinking "I know! Let's piss away more transistors on finer-grained control nobody has asked for, and slow down memory operations even more checking for that." ;-)
- IMHO, the radix tree code is a bit simpler than Tim's obmalloc-big-pool code.
Absolutely so. There's another way to look at this: if Vladimir Marangozov (obmalloc's original author) had used an arena radix tree from the start, would someone now get anywhere proposing a patch to change it to the current scheme? I'd expect a blizzard of -1 votes, starting with mine ;-)
... My feeling right now is that Tim's obmalloc-big-pool is the best design at this point. Using 8 KB or 16 KB pools seems to be better than 4 KB. The extra complexity added by Tim's change is not so nice. obmalloc is already extremely subtle and obmalloc-big-pool makes it more so.
Moving to bigger pools and bigger arenas are pretty much no-brainers for us, but unless pool size is increased there's no particular reason to pursue either approach - "ain't broke, don't fix". Larry Hastings started a "The untuned tunable parameter ARENA_SIZE" thread here about two years ago, where he got a blizzard of flak for merely suggesting that perhaps, after 20 years, it may be time to increase the 256 KiB arena size ;-) Ah, BTW, there's another real advantage to the radix tree, although it's minor: less "quantization" loss in bigger pools. For example, we can only get 7 objects of the largest (512 bytes) size class now. Move my PR to 16KiB pools and the pool gets 4 times bigger, but we can still only get 7*4 = 28 objects out of that pool. It's still "stealing" bytes out of every page. But the radix tree approach doesn't: only the pool header takes away from block space. So move yours to 16 KiB pools, and we could get 31 512-byte blocks out of the pool (3 more).
On Sat, 15 Jun 2019 01:15:11 -0500 Tim Peters <tim.peters@gmail.com> wrote:
... My feeling right now is that Tim's obmalloc-big-pool is the best design at this point. Using 8 KB or 16 KB pools seems to be better than 4 KB. The extra complexity added by Tim's change is not so nice. obmalloc is already extremely subtle and obmalloc-big-pool makes it more so.
Moving to bigger pools and bigger arenas are pretty much no-brainers for us, [...]
Why "no-brainers"? Bigger pools sound ok, but bigger arenas will make Python less likely to return memory to the system. We should evaluate what problem we are trying to solve here, instead of staring at micro-benchmark numbers on an idle system. Micro-benchmarks don't tell you what happens on a loaded system with many processes, lots of I/O happening. If the problem is the cost of mmap() and munmap() calls, then the solution more or less exists at the system level: jemalloc and other allocators use madvise() with MADV_FREE (see here: https://lwn.net/Articles/593564/). A possible design is a free list of arenas on which you use MADV_FREE to let the system know the memory *can* be reclaimed. When the free list overflows, call munmap() on extraneous arenas. Regards Antoine.
[Tim. to Neil]
Moving to bigger pools and bigger arenas are pretty much no-brainers for us, [...]
[Antoine]
Why "no-brainers"?
We're running tests, benchmarks, the Python programs we always run, Python programs that are important to us, staring at obmalloc stats ... and seeing nothing bad, nothing mysterious, only neutral-to-good results. So what's to think about? ;-) These are 64-bit boxes, with terabytes of virtual address space. Asking for a meg of that is just reserving less than a thousandth of a thousandth of that range of integers, not actual physical RAM.
Bigger pools sound ok,
They're not necessarily independent choices. Increase pool size without increasing arena size, and the number of pools per arena falls. At the extreme, if pools were made the same size as arenas, we'd need at least 32 arenas just to start Python (which uses at least one pool of _every_ possible size class before you hit the prompt - note that, on 64-bit boxes, the number of possible size classes is falling from 64 (3.7) to 32 (3.8), due to some desire to make everything aligned to 16 bytes - which I've already seen account for some programs needing 100s of MB of more RAM).
but bigger arenas will make Python less likely to return memory to the system.
I know that gets repeated a lot, and I usually play along - but why do people believe that? Where's the evidence? At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that. A relative few weren't. Evan Jones wrote all the (considerable!) code to change that, and I massaged it and checked it in - not because there was "scientific proof" that it was more beneficial than harmful (it certainly added new expenses!) overall, but because it seemed like a right thing to do, _anticipating_ that the issue would become more important in coming years. I'm still glad it was done, but no tests were checked in to _quantify_ its presumed benefits - or even to verify that it ever returned arenas to the system. Best I can tell, nobody actually has any informed idea how well it does. Evan stared at programs that were important to him, and fiddled things until he was "happy enough". Not everyone was. About five years ago, Kristján Valur Jónsson opened this report: https://bugs.python.org/issue21220 suggesting a very different heuristic to try to free arenas. The "memcrunch..py" in his patch is the only time I've ever seen anyone write code trying to measure whether obmalloc's arena-freeing is effective. I can verify that if you increase the number of objects in his script by a factor of 100, my PR _never_ returns an arena to the system. But it's clear as mud to me whether his heuristic would either (with the 100x smaller number of objects in the original script, the PR branch does recycle arenas). So that's the only objective evidence I have :-) I've looked at obmalloc stats in other programs at various stages, and saw nothing concerning. memchunk.py appears to model object lifetimes as coming from a uniform distribution, but in real life they appear to be strongly multi-modal (with high peaks at the "way less than an eye blink" and "effectively immortal" ends).
We should evaluate what problem we are trying to solve here,
I'm not trying to solve a problem. This is a "right thing to do" thing, anticipating that slinging a massive number of objects on massive machines will become ever more important, and that changing 20-year-old constants will allow obmalloc to spend more time in its fastest paths instead of its slowest. We haven't been especially pro-active about giant machines, and are suffering from it: https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters... """ Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine. """ That was _probably_ due to obmalloc's move-one-at-a-time way of keeping its usable arenas list sorted, which sat un-diagnosed for over a year. https://bugs.python.org/issue32846 Fixing the underlying cause put giant machines on my radar, and getting rid of obmalloc's pool size limit was the next obvious thing that would help them (although not in the same universe as cutting quadratic time to linear).
instead of staring at micro-benchmark numbers on an idle system.
My only interest in those is that they're not slowing down, because that's important too. The aim here is much more to make life better for programs slinging millions - even billions - of objects. obmalloc's internal overheads are frugal, but not free. For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
Micro-benchmarks don't tell you what happens on a loaded system with many processes, lots of I/O happening.
While running a loaded system with many processes and lots of I/O doesn't tell you what happens in micro-benchmarks ;-) A difference is that performance of micro-benchmarks can be more-than-less reliably measured, while nobody can do better than guess about how changes affect a loaded system. Changing address space reservations from less than trivial to still less than trivial just isn't a _plausible_ source of disaster, at least not to my eyes.
If the problem is the cost of mmap() and munmap() calls, then the solution more or less exists at the system level: jemalloc and other allocators use madvise() with MADV_FREE (see here: https://lwn.net/Articles/593564/).
A possible design is a free list of arenas on which you use MADV_FREE to let the system know the memory *can* be reclaimed. When the free list overflows, call munmap() on extraneous arenas.
People can certainly pursue that if they like. I'm not interested in adding more complication that helps only one of obmalloc's slowest paths on only one platform. Increasing pool and arena sizes targets all the slower paths on all platforms. The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows. The "real" problem you seem to care about is returning arenas to the system. I think more work should be done on that, but it's not the goal of _this_ PR. I'm not ignoring it, but so far I've seen no reason to be worried (not in my PR - but how the much larger arenas Neil generally uses fare in this respect is unknown to me).
On 2019-06-15, Tim Peters wrote:
At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that.
Yeah, I was totally fine with that back in the day. However, I wonder now if there is a stronger reason to try to free memory back to the OS. Years ago, people would typically have swap space that was as large or larger than their real RAM. So, if the OS had to swap out unused pages, it wasn't a big deal. Now that disks are relatively so much slower and RAM is larger, people don't have as much swap. Some Linux systems get setup without any. Freeing arenas seems more important than it used to be. OTOH, I don't think obmalloc should try too hard. The whole point of the small object allocator is to be really fast. Anti-fragmentation heuristics are going to slow it down. As far as I'm concerned, it works well enough as it is.
[Tim]
At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that.
[Neil]
Yeah, I was totally fine with that back in the day. However, I wonder now if there is a stronger reason to try to free memory back to the OS. Years ago, people would typically have swap space that was as large or larger than their real RAM. So, if the OS had to swap out unused pages, it wasn't a big deal. Now that disks are relatively so much slower and RAM is larger, people don't have as much swap. Some Linux systems get setup without any. Freeing arenas seems more important than it used to be.
See my earlier "right thing to do ... anticipating" ;-) It was clear enough then that processor speeds were going to continue to increase much faster than disk speeds, although the _primary_ driver at the time was the then-small but increasing number of long-running Python programs running big (for the time) data-crunching problems that repeatedly went through stages of low and high memory need.
OTOH, I don't think obmalloc should try too hard. The whole point of the small object allocator is to be really fast. Anti-fragmentation heuristics are going to slow it down. As far as I'm concerned, it works well enough as it is.
Freeing arenas _already_ slowed it down. All the cruft to keep arenas sorted by number of free pools was far from free. Before that, free pools for a given size class were singly-linked together directly (regardless of which arenas they were in), and we just used the head of that list. Arenas weren't involved at all. Indeed, pools didn't even point back to their arenas. So major code and data changes were needed to implement the "take from the non-full pool in the most heavily populated arena" heuristic. The other heuristic worth considering is "take from the non-full pool with the smallest arena address", which directly strives to let the higher-addressed arenas free up. Since they're both just poke-&-hope, it's hard to out-think them. But it's possible one would consistently work better than the other, and that even _which_ one could change depending on arena and pool sizes. And should note that "poke-&-hope" is the best that can be hoped for, since it's impossible to predict the lifetime of an object when space for it is requested, and we can't move the object after its address is returned to the caller. More amusing: address_in_range() didn't exist either at first. Instead a pool header contained its own address and a "magic" constant. So, e.g., if (pool->pooladdr != pool || pool->magic != POOL_MAGIC) { // This ain't ours! Pass it to the system malloc family instead. } That was unlikely to believe something that _had_ come from the system malloc really belonged to obmalloc, but proved very easy to trick into making that error, even from pure Python code. Good times ;-)
On Sat, 15 Jun 2019 19:56:58 -0500 Tim Peters <tim.peters@gmail.com> wrote:
At the start, obmalloc never returned arenas to the system. The vast majority of users were fine with that. A relative few weren't. Evan Jones wrote all the (considerable!) code to change that, and I massaged it and checked it in - not because there was "scientific proof" that it was more beneficial than harmful (it certainly added new expenses!) overall, but because it seemed like a right thing to do, _anticipating_ that the issue would become more important in coming years.
I'm still glad it was done, but no tests were checked in to _quantify_ its presumed benefits - or even to verify that it ever returned arenas to the system. Best I can tell, nobody actually has any informed idea how well it does. Evan stared at programs that were important to him, and fiddled things until he was "happy enough".
We moved from malloc() to mmap() for allocating arenas because of user requests to release memory more deterministically: https://bugs.python.org/issue11849 And given the number of people who use Python for long-running processes nowadays, I'm sure that they would notice (and be annoyed) if Python did not release memory after memory consumption spikes.
I've looked at obmalloc stats in other programs at various stages, and saw nothing concerning. memchunk.py appears to model object lifetimes as coming from a uniform distribution, but in real life they appear to be strongly multi-modal (with high peaks at the "way less than an eye blink" and "effectively immortal" ends).
I agree they will certainly be multi-modal, with the number of modes, their respective weights and their temporal distance widely dependent on use cases. (the fact that they're multi-modal is the reason why generational GC is useful, btw)
We haven't been especially pro-active about giant machines, and are suffering from it:
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters...
So you're definitely trying to solve a problem, right?
Fixing the underlying cause put giant machines on my radar, and getting rid of obmalloc's pool size limit was the next obvious thing that would help them (although not in the same universe as cutting quadratic time to linear).
"Not in the same universe", indeed. So the question becomes: does the improvement increasing the pool and arena size have a negative outcome on *other* use cases? Not everyone has giant machines. Actually a frequent usage model is to have many small VMs or containers on a medium-size machine.
For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
In relative terms, assuming that arenas are 50% full on average (probably a pessimistic assumption?), that overhead is 0.08% per arena memory used. What point is worrying about that?
If the problem is the cost of mmap() and munmap() calls, then the solution more or less exists at the system level: jemalloc and other allocators use madvise() with MADV_FREE (see here: https://lwn.net/Articles/593564/).
A possible design is a free list of arenas on which you use MADV_FREE to let the system know the memory *can* be reclaimed. When the free list overflows, call munmap() on extraneous arenas.
People can certainly pursue that if they like. I'm not interested in adding more complication that helps only one of obmalloc's slowest paths on only one platform.
MADV_FREE is available on multiple platforms (at least Linux, macOS, FreeBSD). Windows seems to offer similar facilities: https://devblogs.microsoft.com/oldnewthing/20170113-00/?p=95185
The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows.
That's assuming the dominating term in mmap() cost is O(1) rather than O(size). That's not a given. The system call cost is certainly O(1), but the cost of reserving and mapping HW pages, and zeroing them out is most certainly O(# pages). Regards Antoine.
[Antoine]
We moved from malloc() to mmap() for allocating arenas because of user requests to release memory more deterministically:
Which was a good change! As was using VirtualAlloc() on Windows. None of that is being disputed. The change under discussion isn't about mmap - mmap only incidentally gets sucked in here because it's part of obmalloc's _the_ very slowest paths. Again, I'm aiming at _all_ of obmalloc's slower paths, on all platforms, at once. mmap isn't my focus. That doesn't preclude anyone who cares lots about mmap from adding more complication to cater specifically to it.
And given the number of people who use Python for long-running processes nowadays, I'm sure that they would notice (and be annoyed) if Python did not release memory after memory consumption spikes.
The PR changes nothing about the "release arenas" heuristic or implementation. There's just unquantified speculation that boosting arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less trivial 1 MiB, may be disastrous. The only evidence for that so far is repetition of the speculation ;-)
... We haven't been especially pro-active about giant machines, and are suffering from it:
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters...
So you're definitely trying to solve a problem, right?
By my definition of "a problem", no. I have no quantified goals or any criteria to say "and now it's solved". I have open-ended concerns about how Python will fare on giant machines slinging billions of objects, and want to make things better _before_ users are provoked to complain. Which they'll do anyway, of course. I'm not trying to change human nature ;-)
So the question becomes: does the improvement increasing the pool and arena size have a negative outcome on *other* use cases?
Which is why Neil & I have been running benchmarks, tests, Python programs we run all the time, Python programs we care about ... and actively soliciting feedback from people actually trying the code on programs _they_ care about.
Not everyone has giant machines. Actually a frequent usage model is to have many small VMs or containers on a medium-size machine.
Something I've never done, so am wholly unqualified to judge. I don't even know what "many", "small", or "medium-size" might mean in this context. And I don't have a giant machine either, but spent most of my professional career in the "supercomputer" business so at least understand how those kinds of people think ;-)
For example, it has to allocate at least 56 bytes of separate bookkeeping info for each arena. Nobody cares when they have 100 arenas, but when there are a million arenas (which I've seen), that adds up.
In relative terms, assuming that arenas are 50% full on average (probably a pessimistic assumption?), that overhead is 0.08% per arena memory used. What point is worrying about that?
You're only looking at one cost. Those bytes aren't just address reservations, they consume actual physical RAM. The bookkeeping info is periodically read, and mutated, over time. In aggregate, that's more than enough bytes to blow away an entire L3 cache. The less of that stuff needs to be read and written (i.e., the fewer arenas there are), the less pressure that puts on the faster layers (caches) of the memory hierarchy. That bookkeeping info is also immortal: all the bookkeeping arena_object structs live a single contiguously allocated vector. It may move over time (via realloc()), but can never shrink, only grow. Asking the platform malloc/realloc for a 50 MB chunk sucks on the face of it :-( So while the 50M is first-order trivial when a million arenas are in use, if the program enters a new phase releasing almost all of it, leaving (say) only 10 arenas still active, the 50 M is still there, effectively wasting 200 arenas' worth of "invisible" (to _debugmallocstats()) space forever. About typical arena usage, I expect, but don't know, that 50% is quite pessimistic. It's a failure of project management (starting with me) that _every_ step in the "free arenas" evolution was driven by a user complaining about their program, and that nothing was ever checked in to even ensure their problem remained solved, let alone to help find out how effective arena-releasing is in an arbitrary program. We've been flying blind from the start, and remain blind. That said, over the decades I've often looked at obmalloc stats, and have generally been pleasantly surprised at how much of allocated space is being put to good use. 80% seems more typical than 50% to me based on that. It's easy enough to contrive programs tending toward only 16 bytes in use per arena, but nobody has ever reported anything like that. The aforementioned memcrunch.py program wasn't particularly contrived, but was added to a bug report to "prove" that obmalloc's arena releasing strategy was poor. Here are some stats from running that under my PR, but using 200 times the initial number of objects as the original script: n = 20000000 #number of things At the end, with 1M arena and 16K pool: 3362 arenas * 1048576 bytes/arena = 3,525,312,512 # bytes in allocated blocks = 1,968,233,888 WIth 256K arena and 4K pool: 13375 arenas * 262144 bytes/arena = 3,506,176,000 # bytes in allocated blocks = 1,968,233,888 So even there over 50% of arena space was in allocated blocks at the end. Total arena space remained essentially the same either way. However, with smaller arenas the peak memory use was _higher_ It did manage to release over 100 arenas (about 1% of the total ever allocated). With the larger arenas, none were ever released. Either way, 32,921,097 objects remained in use at the end. _ If_ the program had gone on to create another mass of new objects, the big-arena version was better prepared for it: it had 8,342,661 free blocks of the right size class ready to reuse, but the small-arena version had 2,951,103. That is, releasing address space isn't a pure win either - it only pays if it so happens that the space won't be needed soon again. If the space is so needed, releasing the space was a waste of time. There is no "hard science" in poke-&-hope ;-)
..... The dead obvious, dead simple, way to reduce mmap() expense is to call it less often, which just requires changing a compile-time constant - which will also call VirtualAlloc() equally less often on Windows.
That's assuming the dominating term in mmap() cost is O(1) rather than O(size). That's not a given. The system call cost is certainly O(1), but the cost of reserving and mapping HW pages, and zeroing them out is most certainly O(# pages).
Good point! I grossly overstated it. But since I wasn't focused on mmap to begin with, it doesn't change any of my motivations for wanting to boost pool and arena sizes. BTW, anyone keen to complicate the mmap management should first take this recent change into account:: https://bugs.python.org/issue37257 That appears to have killed off _the_ most overwhelmingly common cause of obmalloc counter-productively releasing an arena only to create a new one again milliseconds later. My branch, and Neil's, both contain that change, which makes it much harder to compare our branches' obmalloc arena stats with 3.7. It turns out that a whole lot of "released arenas" under 3.7 (and will still be so in 3.8) were due to that worse-than-useless arena thrashing.
[Tim]
... Here are some stats from running [memcrunch.py] under my PR, but using 200 times the initial number of objects as the original script:
n = 20000000 #number of things
At the end, with 1M arena and 16K pool:
3362 arenas * 1048576 bytes/arena = 3,525,312,512 # bytes in allocated blocks = 1,968,233,888 ... With the larger arenas, none [arenas] were ever released.
...
BTW, anyone keen to complicate the mmap management should first take this recent change into account::
https://bugs.python.org/issue37257
That appears to have killed off _the_ most overwhelmingly common cause of obmalloc counter-productively releasing an arena only to create a new one again milliseconds later.
My branch, and Neil's, both contain that change, which makes it much harder to compare our branches' obmalloc arena stats with 3.7. It turns out that a whole lot of "released arenas" under 3.7 (and will still be so in 3.8) were due to that worse-than-useless arena thrashing.
To illustrate, I reverted that change in my PR and ran exactly same thing. Wow - _then_ the 1M-arena-16K-pool PR reclaimed 1135(!) arenas instead of none. Almost all worse than uselessly. The only one that "paid" was the last: the run ended with 3361 arenas still in use instead of 3362. Because with the change, one entirely empty arena remained on the usable_arenas list. So, word to the wise: when looking at _debugmallocstats() output, like: # arenas allocated total = 4,496 # arenas reclaimed = 1,135 # arenas highwater mark = 3,362 # arenas allocated current = 3,361 3361 arenas * 1048576 bytes/arena = 3,524,263,936 the number "reclaimed" isn't really telling us much: before 3.9, it may be telling us only how many times obmalloc wasted time on useless arena thrashing. The _important_ bit is the difference between "highwater mark" and "allocated current". That's how much peak arena address reservation declined. In this run, it only managed to release one empty arena from the peak (which the actual PR does not release, because bpo-37257 changed this to keep (at most) one empty arena available for reuse).
On Mon, 17 Jun 2019 13:44:29 -0500 Tim Peters <tim.peters@gmail.com> wrote:
To illustrate, I reverted that change in my PR and ran exactly same thing. Wow - _then_ the 1M-arena-16K-pool PR reclaimed 1135(!) arenas instead of none. Almost all worse than uselessly. The only one that "paid" was the last: the run ended with 3361 arenas still in use instead of 3362. Because with the change, one entirely empty arena remained on the usable_arenas list.
So, word to the wise: when looking at _debugmallocstats() output, like:
# arenas allocated total = 4,496 # arenas reclaimed = 1,135 # arenas highwater mark = 3,362 # arenas allocated current = 3,361 3361 arenas * 1048576 bytes/arena = 3,524,263,936
the number "reclaimed" isn't really telling us much: before 3.9, it may be telling us only how many times obmalloc wasted time on useless arena thrashing.
That's a very nice improvement :-) Regards Antoine.
On 2019-06-15, Antoine Pitrou wrote:
We should evaluate what problem we are trying to solve here, instead of staring at micro-benchmark numbers on an idle system.
I think a change to obmalloc is not going to get accepted unless we can show it doesn't hurt these micro-benchmarks. To displace the status quo, it has to give other advantages as well. I don't have any agenda or "problem to solve". After Tim made a PR to allow obmalloc to use larger pools, I thought it would be interesting to see if a arena mapping scheme based on radix trees should be performance competitive. I'm not proposing any changes to CPython at this point. I'm sharing the results of an experiment. I thought it was interesting. I guess you don't.
On Sat, 15 Jun 2019 22:02:35 -0600 Neil Schemenauer <nas-python@arctrix.com> wrote:
On 2019-06-15, Antoine Pitrou wrote:
We should evaluate what problem we are trying to solve here, instead of staring at micro-benchmark numbers on an idle system.
I think a change to obmalloc is not going to get accepted unless we can show it doesn't hurt these micro-benchmarks. To displace the status quo, it has to give other advantages as well. I don't have any agenda or "problem to solve". After Tim made a PR to allow obmalloc to use larger pools, I thought it would be interesting to see if a arena mapping scheme based on radix trees should be performance competitive. I'm not proposing any changes to CPython at this point. I'm sharing the results of an experiment. I thought it was interesting. I guess you don't.
I'm not saying it's not interesting. I'm just saying that you can't validate memory allocator changes only on short-running micro-benchmarks with well-behaved allocation patterns. Regards Antoine.
obmalloc is very nice at allocating small (~224 bytes) memory blocks. But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me. ```
pool_size = 4096 - 48 # 48 is pool header size for bs in range(16, 513, 16): ... n,r = pool_size//bs, pool_size%bs + 48 ... print(bs, n, r, 100*r/4096) ... 16 253 48 1.171875 32 126 64 1.5625 48 84 64 1.5625 64 63 64 1.5625 80 50 96 2.34375 96 42 64 1.5625 112 36 64 1.5625 128 31 128 3.125 144 28 64 1.5625 160 25 96 2.34375 176 23 48 1.171875 192 21 64 1.5625 208 19 144 3.515625 224 18 64 1.5625 240 16 256 6.25 256 15 256 6.25 272 14 288 7.03125 288 14 64 1.5625 304 13 144 3.515625 320 12 256 6.25 336 12 64 1.5625 352 11 224 5.46875 368 11 48 1.171875 384 10 256 6.25 400 10 96 2.34375 416 9 352 8.59375 432 9 208 5.078125 448 9 64 1.5625 464 8 384 9.375 480 8 256 6.25 496 8 128 3.125 512 7 512 12.5
There are two problems here.
First, pool overhead is at most about 3.5 % until 224 bytes.
But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
bytes, and 12.5% at 512 bytes.
Second, some size classes have the same number of memory blocks.
Class 272 and 286 have 14 blocks. 320 and 336 have 12 blocks.
It reduces utilization of pools. This problem becomes bigger on 32bit platform.
Increasing pool size is one obvious way to fix these problems.
I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
a sweet spot for recent web servers (typically, about 32 threads, and
64GiB), but there is no evidence about it.
We need a reference application and scenario to benchmark.
pyperformance is not good for measuring memory usage of complex
applications.
header_size = 48 pool_size = 16*1024 for bs in range(16, 513, 16): ... n = (pool_size - header_size) // bs ... r = (pool_size - header_size) % bs + header_size ... print(bs, n, r, 100 * r / pool_size) ... 16 1021 48 0.29296875 32 510 64 0.390625 48 340 64 0.390625 64 255 64 0.390625 80 204 64 0.390625 96 170 64 0.390625 112 145 144 0.87890625 128 127 128 0.78125 144 113 112 0.68359375 160 102 64 0.390625 176 92 192 1.171875 192 85 64 0.390625 208 78 160 0.9765625 224 72 256 1.5625 240 68 64 0.390625 256 63 256 1.5625 272 60 64 0.390625 288 56 256 1.5625 304 53 272 1.66015625 320 51 64 0.390625 336 48 256 1.5625 352 46 192 1.171875 368 44 192 1.171875 384 42 256 1.5625 400 40 384 2.34375 416 39 160 0.9765625 432 37 400 2.44140625 448 36 256 1.5625 464 35 144 0.87890625 480 34 64 0.390625 496 32 512 3.125 512 31 512 3.125
Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
to 256 and believe malloc works well for medium size memory blocks.
--
Inada Naoki <songofacandy@gmail.com>
[Inada Naoki <songofacandy@gmail.com>]
obmalloc is very nice at allocating small (~224 bytes) memory blocks. But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.
For the "unavoidable memory waste" reasons you spell out here, Vladimir deliberately set the threshold to 256 at the start. As things turned out, though, dicts drove the decision to boost it to 512 anyway. A comment in the code says: * Note: a size threshold of 512 guarantees that newly created dictionaries * will be allocated from preallocated memory pools on 64-bit. That was probably before "compact dicts", but, as-is, obmalloc can handle small dicts now even if they're not empty. And small dicts are common. Another relevant one pushing dicts to use obmalloc: https://bugs.python.org/issue23601
... Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD to 256 and believe malloc works well for medium size memory blocks.
Alas, it doesn't ;-) Well, everyone who ever timed it found obmalloc was significantly faster than the system malloc for every size class obmalloc handles. That's partly because obmalloc runs under protection of the GIL, while the system malloc has to worry about thread safety. But, regardless, obmalloc is just plain hard to beat so long as it can stay in its fast paths (which increasing pool and arena sizes aims at helping it do). Its current pool and arena sizes are tiny compared to comparable structures in the high-profile allocators often mentioned in these discussions. However, the latter generally use platform-specific ways to spell "free this OS page" rather than waiting for an entire arena-like structure to become empty. For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to holding _only_ 16-byte objects. Now you want 388? Cool, we'll reserved another 2 MiB to holding only 388-byte objects. Etc." [1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf
```
pool_size = 4096 - 48 # 48 is pool header size for bs in range(16, 513, 16): ... n,r = pool_size//bs, pool_size%bs + 48 ... print(bs, n, r, 100*r/4096) ... 16 253 48 1.171875 32 126 64 1.5625 48 84 64 1.5625 64 63 64 1.5625 80 50 96 2.34375 96 42 64 1.5625 112 36 64 1.5625 128 31 128 3.125 144 28 64 1.5625 160 25 96 2.34375 176 23 48 1.171875 192 21 64 1.5625 208 19 144 3.515625 224 18 64 1.5625 240 16 256 6.25 256 15 256 6.25 272 14 288 7.03125 288 14 64 1.5625 304 13 144 3.515625 320 12 256 6.25 336 12 64 1.5625 352 11 224 5.46875 368 11 48 1.171875 384 10 256 6.25 400 10 96 2.34375 416 9 352 8.59375 432 9 208 5.078125 448 9 64 1.5625 464 8 384 9.375 480 8 256 6.25 496 8 128 3.125 512 7 512 12.5
There are two problems here. First, pool overhead is at most about 3.5 % until 224 bytes. But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464 bytes, and 12.5% at 512 bytes.
Note that in sys._debugmallocstats() output, the total of the "unusable for this reason" space is called "bytes lost to quantization", It's rarely a big deal in my experience, but certainly tends to account for a higher percentage of arena space as the average size of used objects increases.
Second, some size classes have the same number of memory blocks. Class 272 and 286 have 14 blocks. 320 and 336 have 12 blocks. It reduces utilization of pools. This problem becomes bigger on 32bit platform.
I don't understand the "32bit platform" comment. On 32-bit boxes, alignment is to 8 byte boundaries (not 16 as above), and there are 64 size classes (not 32 as above). Which tends to make space efficiency better, not worse, than on 64-bit Python (16-byte alignment is a waste for the vast majority of objects Python programs use, which rarely require better than pointer alignment (on 64-byte boxes) or double (on 32-bit ones)).
Increasing pool size is one obvious way to fix these problems.
Yes & no ;-) In Neil's radix-tree branch your code above works fine when the pool is increased. But my PR doesn't introduce a new tree, building on what obmalloc already does: it absolutely needs to "steal" 4 bytes at the start of every 4 KiB page - which turns into 16 bytes on 64-boxes to preserve 16-bit alignment. The number of usable blocks in a pool is computed by this function in the PR: /* Return total number of blocks in pool of size index `i`, as a uint. */ static uint NUMBLOCKS(int i) { assert(0 <= i && i < NB_SMALL_SIZE_CLASSES); /* The first page burns space for a pool header, and remaining pages * burn ALIGNMENT bytes for the arena index. */ const uint size = INDEX2SIZE(i); uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD; uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT; return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1); } For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte objects than we can get from 4 4-KiB pools, but (e.g.) doesn't help at all for 512-byte objects.
I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
We need a reference application and scenario to benchmark. pyperformance is not good for measuring memory usage of complex applications.
On the other hand, when it comes to complex applications we're always flying blind ;-)
On Mon, 17 Jun 2019 11:15:29 +0900 Inada Naoki <songofacandy@gmail.com> wrote:
Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
Note that the OS won't give a huge page automatically, because memory management becomes much more inflexible then. For example, the Linux madvise() man page has this to say about MADV_HUGEPAGE: This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page). See the Linux kernel source file Documenta‐ tion/vm/transhuge.txt for more details. I'm not sure a small objects allocator falls into the right use case for huge pages. Regards Antoine.
On Mon, Jun 17, 2019 at 5:18 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 Jun 2019 11:15:29 +0900 Inada Naoki <songofacandy@gmail.com> wrote:
Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
Note that the OS won't give a huge page automatically, because memory management becomes much more inflexible then.
When we used contiguous 2MB pages, Linux may use Huge Page implicitly when Transparent Huge Page is enabled. Then, if we munmap one page of the 2MB, the kernel will split the huge page into small many pages again. I don't know it really happens on Python applications, but using 2MB arena will reduce the risk of performance penalty of page splitting. In web applications, it's common to one Python worker process use 50~100+ MiB RSS. 2MB arena seems reasonable for those applications. I am not proposing making this default. I just meant this may be a reasonable configuration for many Python users who create medium~large size web apps. Regards, -- Inada Naoki <songofacandy@gmail.com>
Le 17/06/2019 à 10:55, Inada Naoki a écrit :
On Mon, Jun 17, 2019 at 5:18 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 Jun 2019 11:15:29 +0900 Inada Naoki <songofacandy@gmail.com> wrote:
Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
Note that the OS won't give a huge page automatically, because memory management becomes much more inflexible then.
When we used contiguous 2MB pages, Linux may use Huge Page implicitly when Transparent Huge Page is enabled.
But it's not enabled by default... And there are reasons for that (see the manpage I quoted).
In web applications, it's common to one Python worker process use 50~100+ MiB RSS. 2MB arena seems reasonable for those applications.
Perhaps, but what is the problem you are trying to solve? Do you have evidence that memory management of those 50-100 MB is costly? Perhaps those 50-100 MB are allocated at worker startup (module initialization structures, docstrings...) and only get deallocated at the end, so they aren't really a problem for arena allocation cost. Regards Antoine.
On Mon, Jun 17, 2019 at 6:14 PM Antoine Pitrou <antoine@python.org> wrote:
But it's not enabled by default... And there are reasons for that (see the manpage I quoted).
Uh, then if people want to use huge page, they need to enable it on system wide, or add madvice in obmalloc.c.
In web applications, it's common to one Python worker process use 50~100+ MiB RSS. 2MB arena seems reasonable for those applications.
Perhaps, but what is the problem you are trying to solve? Do you have evidence that memory management of those 50-100 MB is costly?
I just meant we may be able to utilize THP if we provide large arena option. In other words, *if* we provide configure option to increase arena & pool size, 2MB arena seems reasonable to me. That's all I wanted to say here. I didn't mean utilizing THP is the main motivation to increase arena size. People who want to use huge page may have a problem to solve by huge page. But I don't have it. Regards, -- Inada Naoki <songofacandy@gmail.com>
[Inada Naoki]
Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
[Antoine]
Note that the OS won't give a huge page automatically, because memory management becomes much more inflexible then.
For example, the Linux madvise() man page has this to say about MADV_HUGEPAGE:
This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page). See the Linux kernel source file Documenta‐ tion/vm/transhuge.txt for more details.
I'm not sure a small objects allocator falls into the right use case for huge pages.
The SuperMalloc paper I recently pointed at notes that it uses huge pages only for "huge" requests. Not for "small", "medium", or "large" requests. But it carves up 2 MiB chunks. aligned at 2 MiB addresses, for each size class anyway (which use 4K pages). There are a mix of reasons for that. Partly for the same reasons I want bigger pools and arenas: to stay in the fastest code paths. Hitting page/arena/chunk boundaries costs cycles for computation and conditional branches, and clobbers cache lines to access & mutate bookkeeping info that the fast paths don't touch. Also to reduce the fraction of allocator space "wasted" on bookkeeping info. 48 header bytes out of a 4K pool is a bigger percentage hit than, say, two 4K pages (to hold fancier allocator bookkeeping data structures) out of a 2M chunk. And partly for the same reason Neil is keen for bigger arenas in his branch: to reduce the size of data structures to keep track of other bookkeeping info (in Neil's case, a radix tree, which can effectively shift away the lowest ARENA_BITS bits of addresses it needs to store). Which hints at much of why it wants "huge" chunks, but doesn't explain why it doesn't want huge pages except to satisfy huge requests. That's because it strives to be able to release physical RAM back to the system on a page basis (which is also part of why it needs fancier bookkeeping data structures to manage its chunks - it needs to keep track of which pages are in use, and apply page-based heuristics to push toward freeing pages). So that combines very much larger "pools" (2M v 4K) with better chances of actually returning no-longer-used pages to the system (on a 4K basis rather than a 256K basis). But it's built on piles of platform-specific code, and isn't suitable at all for 32-bit boxes (it' relies on that virtual address space is an abundant resource on 64-bit boxes - reserving 2M of address space is close to trivial, and could potentially be done millions of times without getting in trouble).
Heh. I wasn't intending to be nasty, but this program makes our arena recycling look _much_ worse than memcrunch.py does. It cycles through phases. In each phase, it first creates a large randomish number of objects, then deletes half of all objects in existence. Except that every 10th phase, it deletes 90% instead. It's written to go through 100 phases, but I killed it after 10 because it was obviously going to keep on growing without bound. Note 1: to do anything deterministic with obmalloc stats these days appears to require setting the envar PYTHONHASHSEED to 0 before running (else stats vary even by the time you get to an interactive prompt). Note 2: there are 3 heavily used size classes here, for ints, 2-tuples, and class instances, of byte sizes 32, 64, and 96 on 64-bit boxes, under my PR and under released 3.7.3. First with my branch, after phase 10 finishes building objects: phase 10 adding 9953410 phase 10 has 16743920 objects # arenas allocated total = 3,114 # arenas reclaimed = 0 # arenas highwater mark = 3,114 # arenas allocated current = 3,114 3114 arenas * 1048576 bytes/arena = 3,265,265,664 # bytes in allocated blocks = 3,216,332,784 No arenas have ever been reclaimed, but space utilization is excellent (about 98.5% of arenas are being used by objects). Then after phase 10 deletes 90% of everything still alive: phase 10 deleting factor 90% 15069528 phase 10 done deleting # arenas allocated total = 3,114 # arenas reclaimed = 0 # arenas highwater mark = 3,114 # arenas allocated current = 3,114 3114 arenas * 1048576 bytes/arena = 3,265,265,664 # bytes in allocated blocks = 323,111,488 Still no arenas have been released, and space utilization is horrid. A bit less than 10% of allocated space is being use for objects. Now under 3.7.3. First when phase 10 is done building: phase 10 adding 9953410 phase 10 has 16743920 objects # arenas allocated total = 14,485 # arenas reclaimed = 2,020 # arenas highwater mark = 12,465 # arenas allocated current = 12,465 12465 arenas * 262144 bytes/arena = 3,267,624,960 # bytes in allocated blocks = 3,216,219,656 Space utilization is again excellent. A significant number of arenas were reclaimed - but usefully? Let's see how things turn out after phase 10 ends deleting 90% of the objects: phase 10 deleting factor 90% 15069528 phase 10 done deleting # arenas allocated total = 14,485 # arenas reclaimed = 2,020 # arenas highwater mark = 12,465 # arenas allocated current = 12,465 12465 arenas * 262144 bytes/arena = 3,267,624,960 # bytes in allocated blocks = 322,998,360 Didn't manage to reclaim anything! Space utililization is again horrid, and it's actually consuming a bit more arena bytes than when running under the PR. Which is just more of what I've been seeing over & over: 3.7.3 and the PR both do a fine job of recycling arenas, or a horrid job, depending on the program. For excellent recycling, change this program to use a dict instead of a set. So data = {} at the start, fill it with data[serial] = Stuff() and change data.pop() to use .popitem(). The difference is that set elements still appear in pseudo-random order, but dicts are in insertion-time order. So data.popitem() loses the most recently added dict entry, and the program is then just modeling stack allocation/deallocation. def doit(): import random from random import randrange import sys class Stuff: # add cruft so it takes 96 bytes under 3.7 and 3.8 __slots__ = tuple("abcdefg") def __hash__(self): return hash(id(self)) LO = 5_000_000 HI = LO * 2 data = set() serial = 0 random.seed(42) for phase in range(1, 101): toadd = randrange(LO, HI) print("phase", phase, "adding", toadd) for _ in range(toadd): data.add((serial, Stuff())) serial += 1 print("phase", phase, "has", len(data), "objects") sys._debugmallocstats() factor = 0.5 if phase % 10 else 0.9 todelete = int(len(data) * factor) print(f"phase {phase} deleting factor {factor:.0%} {todelete}") for _ in range(todelete): data.pop() print("phase", phase, "done deleting") sys._debugmallocstats() doit()
[Tim]
... Now under 3.7.3. First when phase 10 is done building:
phase 10 adding 9953410 phase 10 has 16743920 objects
# arenas allocated total = 14,485 # arenas reclaimed = 2,020 # arenas highwater mark = 12,465 # arenas allocated current = 12,465 12465 arenas * 262144 bytes/arena = 3,267,624,960
# bytes in allocated blocks = 3,216,219,656
Space utilization is again excellent. A significant number of arenas were reclaimed - but usefully?
Nope! Digging through all the output, all the arena recycling happened in the "_add_ millions of objects" stages, never in the "delete millions of objects" stages. So it was just more accidental arena thrashing (which was recently repaired in bpo-37257).
And one more random clue. The memcrunch.py attached to the earlier-mentioned bug report does benefit a lot from changing to a "use the arena with the smallest address" heuristic, leaving 86.6% of allocated bytes in use by objects at the end (this is with the arena-thrashing fix, and the current 256K/4K arena/pool sizes). Which isn't surprising, else the person who opened that report wouldn't have attached that specific script ;-) However, when I increase its number of starting objects by a factor of 200, changing the heuristic barely helped at all. It managed to reclaim a few dozen more arenas by the end (out of 13,556 total arenas allocated), but left only 56.1% of arena space in use by objects. For the later program I posted, which by accident-rather-than-design is even worse for arena cycling, changing the heuristic didn't help at all. Either way, no arenas were released (although one empty arena is retained in the usable_arenas list), and less than 10% of arena space was in use at the end of phase 10. A few things to note, then: - For truly effective RAM releasing, we would almost certainly need to make major changes, to release RAM at an OS page level. 256K arenas were already too fat a granularity. - Whether a program benefits from what we do now appears to rely by far most on its patterns of allocation/deallocation, rather than on the arena-releasing heuristic OR on the arena size. - While the by-lowest-address heuristic is obviously more effective in one known artificial program so far ;-) , it's more expensive than what we do now. That's a full-blown sorting problem. Ordering by number of free pools is not: that's essentially a bucket-distribution problem, since the number of _possible_ values is small. The data structure changes I already checked in truly take constant time, in all cases, to restore sorted order when an arena's number of free pools changes. When sorting by address, the best that can be done in general is to take O(log(A)) time to insert a new arena (where A is the number of arenas). And a linked list is no good for that either (unless, e.g., it's made fancier, like a skip list). The quick experiment I tried used one-at-time search to put an arena in the by-address-ordered list (the submitted patch also did), and we already know from experience that can become a timing disaster when hundreds of thousands of arenas are in use. OTOH, that only needs to be done once each time an arena is added to usable_arenas, not every time an arena's number of free pools changes. So, overall: - I'm not inclined to pursue changing the arena-freeing heuristic. - Changes to keep track of OS page-level use would likely require starting over from scratch. But would also make using far larger pools practical (on 64-bit boxes). - I still haven't found a program where changing from 256K to 1M arenas makes a lick of appreciable difference in arena recycling effectiveness: for a given program chewing on a given problem size, "does pretty well" or "does horribly" has been the outcome either way.
[Tim]
- For truly effective RAM releasing, we would almost certainly need to make major changes, to release RAM at an OS page level. 256K arenas were already too fat a granularity.
We can approximate that closely right now by using 4K pools _and_ 4K arenas: one pool per arena, and mmap()/munmap() are then called on one page at a time. [Don't try this at home ;-) There are subtle assumptions in the code that there are at least two pools in an arena, and those have to be overcome first.] For memcrunch.py, using 200x the original initial objects, this works quite well! Note that this still uses our current release-arenas heuristic: the only substantive change from the status quo is setting ARENA_SIZE to POOL_SIZE (both 4 KiB - one page). # arenas allocated total = 873,034 # arenas reclaimed = 344,380 # arenas highwater mark = 867,540 # arenas allocated current = 528,654 528654 arenas * 4096 bytes/arena = 2,165,366,784 # bytes in allocated blocks = 1,968,234,336 # bytes in available blocks = 141,719,280 5349 unused pools * 4096 bytes = 21,909,504 # bytes lost to pool headers = 25,118,640 # bytes lost to quantization = 8,385,024 # bytes lost to arena alignment = 0 Total = 2,165,366,784 So, at the end, space utilization is over 90%: 1,968,234,336 / 2,165,366,784 = 0.90896117 OTOH, an even nastier version of the other program I posted isn't helped much at all, ending like so after phase 10: # arenas allocated total = 1,025,106 # arenas reclaimed = 30,539 # arenas highwater mark = 1,025,098 # arenas allocated current = 994,567 994567 arenas * 4096 bytes/arena = 4,073,746,432 # bytes in allocated blocks = 232,861,440 # bytes in available blocks = 2,064,665,008 424741 unused pools * 4096 bytes = 1,739,739,136 # bytes lost to pool headers = 27,351,648 # bytes lost to quantization = 9,129,200 # bytes lost to arena alignment = 0 Total = 4,073,746,432 So space utilization is under 6%: 232,861,440 / 4,073,746,432 = 0.0571615 Believe it or not, that's slightly (but _only_ slightly) better than when using the current 256K/4K arena/pool mix, which released no arenas at all and ended with 232,861,440 / 4,199,022,592 = 0.05545611 utilization. So: - There's substantial room for improvement in releasing RAM by tracking it at OS page level. - But the current code design is (very!) poorly suited for that. - In some non-contrived cases it wouldn't really help anyway. A natural question is how much arena size affects final space utilization for memcrunch.py. Every successive increase over one pool hurts, but eventually it stops mattering much. Here are the possible power-of-2 arena sizes, using 4K pools, ending with the smallest for which no arenas get reclaimed: 1,968,234,336 / 2,165,366,784 = 0.90896117 528654 arenas * 4096 bytes/arena = 2,165,366,784 # bytes in allocated blocks = 1,968,234,336 1,968,234,336 / 2,265,399,296 = 0.86882447 276538 arenas * 8192 bytes/arena = 2,265,399,296 # bytes in allocated blocks = 1,968,234,336 1,968,235,360 / 2,441,314,304 = 0.80621957 149006 arenas * 16384 bytes/arena = 2,441,314,304 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 2,623,799,296 = 0.75014707 80072 arenas * 32768 bytes/arena = 2,623,799,296 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 2,924,216,320 = 0.67308131 44620 arenas * 65536 bytes/arena = 2,924,216,320 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 3,299,475,456 = 0.59652978 25173 arenas * 131072 bytes/arena = 3,299,475,456 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 3,505,913,856 = 0.56140437 13374 arenas * 262144 bytes/arena = 3,505,913,856 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 3,552,051,200 = 0.55411233 6775 arenas * 524288 bytes/arena = 3,552,051,200 # bytes in allocated blocks = 1,968,235,360 1,968,235,360 / 3,553,624,064 = 0.55386707 3389 arenas * 1048576 bytes/arena = 3,553,624,064 # bytes in allocated blocks = 1,968,235,360 Most of the damage was done by the time we reached 128K arenas, and "almost all" when reaching 256K. I expect that's why I'm not seeing much of any effect (on arena recycling effectiveness) moving from the current 256K/4K to the PR's 1M/16K. 256K/4K already required "friendly" allocation/deallocation patterns for the status quo to do real good, and 256K already requires "friendly indeed" ;-)
[Tim]
- For truly effective RAM releasing, we would almost certainly need to make major changes, to release RAM at an OS page level. 256K arenas were already too fat a granularity.
[also Tim]
We can approximate that closely right now by using 4K pools _and_ 4K arenas: one pool per arena, and mmap()/munmap() are then called on one page at a time.
Sorry about this: there was another exceedingly subtle "there's more than one pool in an arena" assumption in the code, which didn't cause anything to fail, but did cause it to hold on to empty arenas in some cases. Repaired that, and then:
For memcrunch.py, using 200x the original initial objects, this works quite well! ... So, at the end, space utilization is over 90%:
1,968,234,336 / 2,165,366,784 = 0.90896117
Which changed to the slightly better: 1,968,235,360 / 2,143,465,472 = 0.91824916
OTOH, an even nastier version of the other program I posted isn't helped much at all, ending like so after phase 10: ... 232,861,440 / 4,073,746,432 = 0.0571615
And that one enjoyed a relatively huge improvement, to: 232,861,440 / 2,334,990,336 = 0.09972694 But none of that changes any of the bottom-level observations or speculations.
participants (5)
-
Antoine Pitrou
-
Antoine Pitrou
-
Inada Naoki
-
Neil Schemenauer
-
Tim Peters