
[Tim]
- For truly effective RAM releasing, we would almost certainly need to
make major changes, to release RAM at an OS page level. 256K arenas were already too fat a granularity.
We can approximate that closely right now by using 4K pools _and_ 4K arenas: one pool per arena, and mmap()/munmap() are then called on one page at a time.
[Don't try this at home ;-) There are subtle assumptions in the code that there are at least two pools in an arena, and those have to be overcome first.]
For memcrunch.py, using 200x the original initial objects, this works quite well! Note that this still uses our current release-arenas heuristic: the only substantive change from the status quo is setting ARENA_SIZE to POOL_SIZE (both 4 KiB - one page).
# arenas allocated total = 873,034 # arenas reclaimed = 344,380 # arenas highwater mark = 867,540 # arenas allocated current = 528,654 528654 arenas * 4096 bytes/arena = 2,165,366,784
# bytes in allocated blocks = 1,968,234,336 # bytes in available blocks = 141,719,280 5349 unused pools * 4096 bytes = 21,909,504 # bytes lost to pool headers = 25,118,640 # bytes lost to quantization = 8,385,024 # bytes lost to arena alignment = 0 Total = 2,165,366,784
So, at the end, space utilization is over 90%:
1,968,234,336 / 2,165,366,784 = 0.90896117
OTOH, an even nastier version of the other program I posted isn't helped much at all, ending like so after phase 10:
# arenas allocated total = 1,025,106 # arenas reclaimed = 30,539 # arenas highwater mark = 1,025,098 # arenas allocated current = 994,567 994567 arenas * 4096 bytes/arena = 4,073,746,432
# bytes in allocated blocks = 232,861,440 # bytes in available blocks = 2,064,665,008 424741 unused pools * 4096 bytes = 1,739,739,136 # bytes lost to pool headers = 27,351,648 # bytes lost to quantization = 9,129,200 # bytes lost to arena alignment = 0 Total = 4,073,746,432
So space utilization is under 6%:
232,861,440 / 4,073,746,432 = 0.0571615
Believe it or not, that's slightly (but _only_ slightly) better than when using the current 256K/4K arena/pool mix, which released no arenas at all and ended with
232,861,440 / 4,199,022,592 = 0.05545611
utilization.
So:
- There's substantial room for improvement in releasing RAM by tracking it at OS page level.
- But the current code design is (very!) poorly suited for that.
- In some non-contrived cases it wouldn't really help anyway.
A natural question is how much arena size affects final space utilization for memcrunch.py. Every successive increase over one pool hurts, but eventually it stops mattering much. Here are the possible power-of-2 arena sizes, using 4K pools, ending with the smallest for which no arenas get reclaimed:
1,968,234,336 / 2,165,366,784 = 0.90896117 528654 arenas * 4096 bytes/arena = 2,165,366,784 # bytes in allocated blocks = 1,968,234,336
1,968,234,336 / 2,265,399,296 = 0.86882447 276538 arenas * 8192 bytes/arena = 2,265,399,296 # bytes in allocated blocks = 1,968,234,336
1,968,235,360 / 2,441,314,304 = 0.80621957 149006 arenas * 16384 bytes/arena = 2,441,314,304 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 2,623,799,296 = 0.75014707 80072 arenas * 32768 bytes/arena = 2,623,799,296 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 2,924,216,320 = 0.67308131 44620 arenas * 65536 bytes/arena = 2,924,216,320 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 3,299,475,456 = 0.59652978 25173 arenas * 131072 bytes/arena = 3,299,475,456 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 3,505,913,856 = 0.56140437 13374 arenas * 262144 bytes/arena = 3,505,913,856 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 3,552,051,200 = 0.55411233 6775 arenas * 524288 bytes/arena = 3,552,051,200 # bytes in allocated blocks = 1,968,235,360
1,968,235,360 / 3,553,624,064 = 0.55386707 3389 arenas * 1048576 bytes/arena = 3,553,624,064 # bytes in allocated blocks = 1,968,235,360
Most of the damage was done by the time we reached 128K arenas, and "almost all" when reaching 256K.
I expect that's why I'm not seeing much of any effect (on arena recycling effectiveness) moving from the current 256K/4K to the PR's 1M/16K. 256K/4K already required "friendly" allocation/deallocation patterns for the status quo to do real good, and 256K already requires "friendly indeed" ;-)