[Python-Dev] Dictionary tuning
Jeff Epler
jepler@unpythonic.net
Wed, 30 Apr 2003 11:16:16 -0500
On Wed, Apr 30, 2003 at 12:32:23AM -0400, Tim Peters wrote:
> If someone wants to experiment with that in lookdict_string(), stick a new
>
> ++i;
>
> before the for loop, and move the existing
>
> i = (i << 2) + i + perturb + 1;
>
> to the bottom of that loop. Likewise for lookdict().
You might also investigate making PyDictEntry a power-of-two bytes big
(it's currently 12 bytes) so that they align nicely in the cache, and
then use
i ^= 1;
instead of
++i;
so that the second key checked is always in the same (32-byte or bigger)
cache line. Of course, increasing the size of PyDictEntry would also
increase the size of all dicts by 33%, so the speed payoff would have
to be big.
It's also not obvious that ma_smalltable will be 32-byte aligned (since
no special effort was made, it's unlikely to be). If it's not, then
this optimization would still not pay (compared to i++) for <= MINSIZE
dictionaries. (which are the important case?)
A little program indicates that the table has an 8-byte or better
alignment, the xor approach gives same-cache-line results more
frequently than the increment approach even with a 12-byte PyDictEntry.
This doesn't quite make sense to me. It also indicates that if the
alignment is not 32 bytes but the dict is 16 bytes that xor is a loss,
which does make sense. The results (for a 32-byte cache line):
algorithm sizeof() alignment % in same cache line
i^=1 12 4 62.5
i^=1 12 8 75.0
i^=1 12 16 75.0
i^=1 12 32 75.0
i^=1 16 4 50.0
i^=1 16 8 50.0
i^=1 16 16 50.0
i^=1 16 32 100.0
++i 12 4 62.5
++i 12 8 62.5
++i 12 16 62.5
++i 12 32 62.5
++i 16 4 50.0
++i 16 8 50.0
++i 16 16 50.0
++i 16 32 50.0
so using i^=1 and adding 4 bytes to each dict (if necessary) to get
8-alignment of ma_smalltable would give a 12.5% increase in the hit rate
of the second probe compared to i++.
Ouch. When I take into account that each probe accesses me_key (not
just me_hash) the results change:
i^=1 16 4 37.5
++i 16 4 37.5
i^=1 12 16 50.0
i^=1 12 32 50.0
i^=1 12 4 50.0
i^=1 12 8 50.0
i^=1 16 16 50.0
i^=1 16 8 50.0
++i 12 16 50.0
++i 12 32 50.0
++i 12 4 50.0
++i 12 8 50.0
++i 16 16 50.0
++i 16 32 50.0
++i 16 8 50.0
i^=1 16 32 100.0
You don't beat i++ unless you go to size 16 with alignment 32.
Looking at the # of cache lines accessed on average, the numbers are
unsurprising. For the 37.5% items, 1.625 cache lines are accessed for
the two probes, 1.5 for the 50% items, and 1.0 for the 100% items.
Looking at the number of cache lines accessed for a single probe,
8-or-better alignment gives 1.0 cache lines accessed for 16-byte
structures, and 1.125 for all other cases (4-byte alignment or 12-byte
structure)
If the "more than 3 probes" case bears optimizing (and I doubt it does),
the for(perturb) loop could be unrolled once, with even iterations using
++i or i^=1 and odd iterations using i = (i << 2) + i + perturb + 1;
so that the same-cache-line property is used as often as possible. Of
course, the code duplication of the rest of the loop body will increase
i-cache pressure a bit.
And I'm surprised if you read this far. Summary: i^=1 is not likely to
win comapred to ++i, unless we increase dict size 33%.
Jeff