
[me]
... In other words, the lastoffset acts as an upper bound watermark.
[Tim]
The comparison in the "try to extend the free list" code is executed very frequently, so probing a static vector this way will slow down the thing noticeably. I've tried that. We are very much running after speed here. [me]
I didn't want to do that optimization before, because it would have resulted in a bigger pool header and waste of space. Now it's ok.
[Tim]
Do both, despite that the multiplication optimization is more important. (for one division we have #blocks multiplications for a full pool) 1) I think that adding back one uint slot (lastoffset) is not a problem (and we end up with the perfect balance of having a pool header layout with 4 pointers and 4 integers <wink>). 2) Now, with the advent of lastoffset, we can easily get rid off the division as well. For this, we switch to offset calculus for figuring out whether the free list can be extended. This means that 'capacity' becomes 'highoffset'. The 'highoffset' is the highwater not to be exceeded by more than 'size' bytes by 'lastoffest'. a) the test becomes then: if (pool->lastoffset <= pool->highoffset) ... pool->lastoffset += size; pool->freeblock = (block *)pool + \ pool->lastoffset; b) in the pool_init code: pool->lastoffset = POOL_OVERHEAD + size; pool->highoffset = POOL_SIZE - size - size; and you'll need to adjust the code accordingly at the places where you're using 'capacity' per se (and adapt the comments, of course). This should pretty much settle these murky optimizations for now. Cheers, Vladimir

[Marangozov, Vladimir (Vladimir) -- I hope you're noticing that your name as reported by your email headers is getting stranger all the time <wink>]
Gotcha. I was left partly unhappy with the "bulletproof address check" for the same reason (needing to index into a static vector of arena base addresses).
Two things have shifted since you wrote this code: + The architecture trend has been to make division ever-slower in comparison to multiplication (e.g., Itanium doesn't even have an integer division instruction); on some boxes integer division is over 100 cycles now, at the same time integer multiplication is down to a few. + We can fit fewer objects into a pool now. For example, on Windows, Python container types are 16 bytes bigger than they used to be, due to gc-header overhead (and an empty dict is actually up to 148 bytes on Windows now!). We could shift that back by making pools larger, but if they exceed the size of a system page then cutting a legit pointer back to a pool-aligned address may yield an invalid address. So it gets dicey short of teaching Python how to ask the OS directly for pages.
Yes, that is perfect <wink>.
I'll do this, but I expect I'll use a nextoffset instead of a lastoffset, in order to let the two additions in 2a proceed in parallel, and to skip one of the subtracts in 2b. Hey, if we're gonna optimize, I'm not gonna miss a cycle.
This should pretty much settle these murky optimizations for now.
Cool! Thanks for your help. You should come back to Python, you know -- we don't know what you're doing instead, but we all have to suspect it's comparatively worthless <wink>.

[Marangozov, Vladimir (Vladimir) -- I hope you're noticing that your name as reported by your email headers is getting stranger all the time <wink>]
Gotcha. I was left partly unhappy with the "bulletproof address check" for the same reason (needing to index into a static vector of arena base addresses).
Two things have shifted since you wrote this code: + The architecture trend has been to make division ever-slower in comparison to multiplication (e.g., Itanium doesn't even have an integer division instruction); on some boxes integer division is over 100 cycles now, at the same time integer multiplication is down to a few. + We can fit fewer objects into a pool now. For example, on Windows, Python container types are 16 bytes bigger than they used to be, due to gc-header overhead (and an empty dict is actually up to 148 bytes on Windows now!). We could shift that back by making pools larger, but if they exceed the size of a system page then cutting a legit pointer back to a pool-aligned address may yield an invalid address. So it gets dicey short of teaching Python how to ask the OS directly for pages.
Yes, that is perfect <wink>.
I'll do this, but I expect I'll use a nextoffset instead of a lastoffset, in order to let the two additions in 2a proceed in parallel, and to skip one of the subtracts in 2b. Hey, if we're gonna optimize, I'm not gonna miss a cycle.
This should pretty much settle these murky optimizations for now.
Cool! Thanks for your help. You should come back to Python, you know -- we don't know what you're doing instead, but we all have to suspect it's comparatively worthless <wink>.
participants (2)
-
Marangozov, Vladimir (Vladimir)
-
Tim Peters