[New-bugs-announce] [issue23359] Speed-up set_lookkey()

Sat Jan 31 22:37:58 CET 2015

New submission from Raymond Hettinger:

This patch applies three techniques to tighten-up the generated code for the lookkey() for sets.

I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did).  As such, it offends my sense of beauty and neatness.

That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair).

The three transformations are:

1.)  Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry.   This eliminates the freeslot tracking in both cases.  

In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot).

In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely.

That eliminated inner inner-loop code used to look like this in gcc-4.9:

    cmpq    %r15, %rbp      entry->key != dummny
    jne L375                                   |
    testq   %r13, %r13      freelslot == NULL  |
    cmove   %rbx, %r13                         |
L375:                                      <---|

And the eliminated inner loop code was even worse with clang:

    leaq    __dummy_struct(%rip), %rax
    cmpq    %rax, %r14                entry->key==dummy
    sete    %al
    testq   %rbx, %rbx                freeslot==NULL
    sete    %cl
    testb   %cl, %al
    cmovneq %r13, %rbx                freeslot ?= entry

2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes.  If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++.

This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address.  It replaces those steps with a single entry++ which codes as an add $16.

3).  In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL).   While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable).   The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key.   This doesn't sound like much but it cuts the number of memory accesses per loop in half.  

Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case).   Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half.

The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean().

--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
    cmpq    %r8, (%rbx)        entry->key == dummy
    je  L453
    cmpq    %r12, %rbx         entry == limit
    movq    %rbx, %r14
    je  L506
L412:
    movq    16(%r14), %rbp
    leaq    16(%r14), %rbx
    testq   %rbp, %rbp         entry->key NULL
    je  L449
    cmpq    8(%rbx), %r15      entry->hash == hash
    jne L393

--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
    cmpq    %r13, %rbx         entry == limit
    je  L942
    addq    $16, %rbx          entry++
    movq    8(%rbx), %rax      entry->hash == NULL
    testq   %rax, %rax
    jne L845
    cmpq    $0, (%rbx)         entry->key == NULL (not in the inner loop)
    je  L905
L845:
    cmpq    %r12, %rax         entry->hash == tgt
    jne L846

Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed.  For now, I just want to document the work that has been done.

Anyway, here it is if someone wants to experiment with it.  Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line).

----------
files: hoisted_and_specialized.diff
keywords: patch
messages: 235131
nosy: rhettinger
priority: low
severity: normal
stage: patch review
status: open
title: Speed-up set_lookkey()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file37945/hoisted_and_specialized.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue23359>
_______________________________________