[Python-Dev] pymalloc killer

Tim Peters tim.one@comcast.net
Thu, 04 Apr 2002 22:14:19 -0500


>> i.e., all  references to pool->capacity *could* be replaced by
>> capacity[pool->szidx],  where "capacity" is a new small (at most
>> 32 uints) file-static const vector computed at compile-time.

[Marangozov, Vladimir (Vladimir) -- I hope you're noticing that your
 name as reported by your email headers is getting stranger all the
 time <wink>]

> The comparison in the "try to extend the free list" code is executed
> very frequently, so probing a static vector this way will slow down the
> thing noticeably. I've tried that. We are very much running after speed
> here.

Gotcha.  I was left partly unhappy with the "bulletproof address check" for
the same reason (needing to index into a static vector of arena base
addresses).

> ...
[on "the division" and "the multiplication" optimizations]
> Do both, despite that the multiplication optimization is more important.
> (for one division we have #blocks multiplications for a full pool)

Two things have shifted since you wrote this code:

+ The architecture trend has been to make division ever-slower in
  comparison to multiplication (e.g., Itanium doesn't even have an
  integer division instruction); on some boxes integer division is over
  100 cycles now, at the same time integer multiplication is down to a
  few.

+ We can fit fewer objects into a pool now.  For example, on Windows,
  Python container types are 16 bytes bigger than they used to be, due
  to gc-header overhead (and an empty dict is actually up to 148 bytes
  on Windows now!).

We could shift that back by making pools larger, but if they exceed the size
of a system page then cutting a legit pointer back to a pool-aligned address
may yield an invalid address.  So it gets dicey short of teaching Python how
to ask the OS directly for pages.

> 1) I think that adding back one uint slot (lastoffset) is not a problem
>    (and we end up with the perfect balance of having a pool header layout
>     with 4 pointers and 4 integers <wink>).

Yes, that is perfect <wink>.

> 2) Now, with the advent of lastoffset, we can easily get rid off the
>    division as well. For this, we switch to offset calculus for figuring
>    out whether the free list can be extended. This means that 'capacity'
>    becomes 'highoffset'. The 'highoffset' is the highwater not to be
>    exceeded by more than 'size' bytes by 'lastoffest'.
>
>    a) the test becomes then:  if (pool->lastoffset <= pool->highoffset)
>                                   ...
>                                   pool->lastoffset += size;
>                                   pool->freeblock = (block *)pool + \
>                                                     pool->lastoffset;
>
>    b) in the pool_init code:  pool->lastoffset = POOL_OVERHEAD + size;
>                               pool->highoffset = POOL_SIZE - size - size;
>
>    and you'll need to adjust the code accordingly at the places where
>    you're using 'capacity' per se (and adapt the comments, of course).

I'll do this, but I expect I'll use a nextoffset instead of a lastoffset, in
order to let the two additions in 2a proceed in parallel, and to skip one of
the subtracts in 2b.  Hey, if we're gonna optimize, I'm not gonna miss a
cycle.

> This should pretty much settle these murky optimizations for now.

Cool!  Thanks for your help.  You should come back to Python, you know -- we
don't know what you're doing instead, but we all have to suspect it's
comparatively worthless <wink>.