[Python-Dev] Dictionary tuning

Wed, 30 Apr 2003 11:16:16 -0500

On Wed, Apr 30, 2003 at 12:32:23AM -0400, Tim Peters wrote:
> If someone wants to experiment with that in lookdict_string(), stick a new
> 
>     ++i;
> 
> before the for loop, and move the existing
> 
> 		i = (i << 2) + i + perturb + 1;
> 
> to the bottom of that loop.  Likewise for lookdict().

You might also investigate making PyDictEntry a power-of-two bytes big
(it's currently 12 bytes) so that they align nicely in the cache, and
then use
    i ^= 1;
instead of 
    ++i;
so that the second key checked is always in the same (32-byte or bigger)
cache line.  Of course, increasing the size of PyDictEntry would also
increase the size of all dicts by 33%, so the speed payoff would have
to be big.

It's also not obvious that ma_smalltable will be 32-byte aligned (since
no special effort was made, it's unlikely to be).  If it's not, then
this optimization would still not pay (compared to i++) for <= MINSIZE
dictionaries. (which are the important case?)

A little program indicates that the table has an 8-byte or better
alignment, the xor approach gives same-cache-line results more
frequently than the increment approach even with a 12-byte PyDictEntry.
This doesn't quite make sense to me.  It also indicates that if the
alignment is not 32 bytes but the dict is 16 bytes that xor is a loss,
which does make sense.  The results (for a 32-byte cache line):
    algorithm   sizeof()    alignment   % in same cache line
    i^=1        12           4           62.5
    i^=1        12           8           75.0
    i^=1        12          16           75.0
    i^=1        12          32           75.0
    i^=1        16           4           50.0
    i^=1        16           8           50.0
    i^=1        16          16           50.0
    i^=1        16          32          100.0
     ++i        12           4           62.5
     ++i        12           8           62.5
     ++i        12          16           62.5
     ++i        12          32           62.5
     ++i        16           4           50.0
     ++i        16           8           50.0
     ++i        16          16           50.0
     ++i        16          32           50.0
so using i^=1 and adding 4 bytes to each dict (if necessary) to get
8-alignment of ma_smalltable would give a 12.5% increase in the hit rate
of the second probe compared to i++.

Ouch.  When I take into account that each probe accesses me_key (not
just me_hash) the results change:
   i^=1      16       4    37.5
    ++i      16       4    37.5
   i^=1      12      16    50.0
   i^=1      12      32    50.0
   i^=1      12       4    50.0
   i^=1      12       8    50.0
   i^=1      16      16    50.0
   i^=1      16       8    50.0
    ++i      12      16    50.0
    ++i      12      32    50.0
    ++i      12       4    50.0
    ++i      12       8    50.0
    ++i      16      16    50.0
    ++i      16      32    50.0
    ++i      16       8    50.0
   i^=1      16      32   100.0
You don't beat i++ unless you go to size 16 with alignment 32.

Looking at the # of cache lines accessed on average, the numbers are
unsurprising.  For the 37.5% items, 1.625 cache lines are accessed for
the two probes, 1.5 for the 50% items, and 1.0 for the 100% items.
Looking at the number of cache lines accessed for a single probe,
8-or-better alignment gives 1.0 cache lines accessed for 16-byte
structures, and 1.125 for all other cases (4-byte alignment or 12-byte
structure)

If the "more than 3 probes" case bears optimizing (and I doubt it does),
the for(perturb) loop could be unrolled once, with even iterations using
++i or i^=1 and odd iterations using i = (i << 2) + i + perturb + 1;
so that the same-cache-line property is used as often as possible.  Of
course, the code duplication of the rest of the loop body will increase
i-cache pressure a bit.

And I'm surprised if you read this far.  Summary: i^=1 is not likely to
win comapred to ++i, unless we increase dict size 33%.

Jeff