On Wed, Apr 30, 2003 at 12:32:23AM -0400, Tim Peters wrote:
If someone wants to experiment with that in lookdict_string(), stick a new
++i;
before the for loop, and move the existing
i = (i << 2) + i + perturb + 1;
to the bottom of that loop. Likewise for lookdict().
You might also investigate making PyDictEntry a power-of-two bytes big (it's currently 12 bytes) so that they align nicely in the cache, and then use i ^= 1; instead of ++i; so that the second key checked is always in the same (32-byte or bigger) cache line. Of course, increasing the size of PyDictEntry would also increase the size of all dicts by 33%, so the speed payoff would have to be big. It's also not obvious that ma_smalltable will be 32-byte aligned (since no special effort was made, it's unlikely to be). If it's not, then this optimization would still not pay (compared to i++) for <= MINSIZE dictionaries. (which are the important case?) A little program indicates that the table has an 8-byte or better alignment, the xor approach gives same-cache-line results more frequently than the increment approach even with a 12-byte PyDictEntry. This doesn't quite make sense to me. It also indicates that if the alignment is not 32 bytes but the dict is 16 bytes that xor is a loss, which does make sense. The results (for a 32-byte cache line): algorithm sizeof() alignment % in same cache line i^=1 12 4 62.5 i^=1 12 8 75.0 i^=1 12 16 75.0 i^=1 12 32 75.0 i^=1 16 4 50.0 i^=1 16 8 50.0 i^=1 16 16 50.0 i^=1 16 32 100.0 ++i 12 4 62.5 ++i 12 8 62.5 ++i 12 16 62.5 ++i 12 32 62.5 ++i 16 4 50.0 ++i 16 8 50.0 ++i 16 16 50.0 ++i 16 32 50.0 so using i^=1 and adding 4 bytes to each dict (if necessary) to get 8-alignment of ma_smalltable would give a 12.5% increase in the hit rate of the second probe compared to i++. Ouch. When I take into account that each probe accesses me_key (not just me_hash) the results change: i^=1 16 4 37.5 ++i 16 4 37.5 i^=1 12 16 50.0 i^=1 12 32 50.0 i^=1 12 4 50.0 i^=1 12 8 50.0 i^=1 16 16 50.0 i^=1 16 8 50.0 ++i 12 16 50.0 ++i 12 32 50.0 ++i 12 4 50.0 ++i 12 8 50.0 ++i 16 16 50.0 ++i 16 32 50.0 ++i 16 8 50.0 i^=1 16 32 100.0 You don't beat i++ unless you go to size 16 with alignment 32. Looking at the # of cache lines accessed on average, the numbers are unsurprising. For the 37.5% items, 1.625 cache lines are accessed for the two probes, 1.5 for the 50% items, and 1.0 for the 100% items. Looking at the number of cache lines accessed for a single probe, 8-or-better alignment gives 1.0 cache lines accessed for 16-byte structures, and 1.125 for all other cases (4-byte alignment or 12-byte structure) If the "more than 3 probes" case bears optimizing (and I doubt it does), the for(perturb) loop could be unrolled once, with even iterations using ++i or i^=1 and odd iterations using i = (i << 2) + i + perturb + 1; so that the same-cache-line property is used as often as possible. Of course, the code duplication of the rest of the loop body will increase i-cache pressure a bit. And I'm surprised if you read this far. Summary: i^=1 is not likely to win comapred to ++i, unless we increase dict size 33%. Jeff