On Sat, Jan 15, 2022 at 8:12 PM Tim Peters <tim.peters@gmail.com> wrote:
Something is missing here, but can't guess what without seeing the
generated machine code.But I trust Mark will do that.

Welp, there goes my weekend. :-)

 $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10"
1500000 loops, best of 5: 376 nsec per loop

Which actually makes little sense to me. [...] Under 4 nsec per iteration seems
close to impossibly fast on a 3.8GHz box, given the presence of any division instruction.
 
<snip>
 
However, dividing by 10 is not a worst case on this box. Dividing by
100 is over 3x slower:

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//100"
1500000 loops, best of 5: 1.25 usec per loop

Now that I certainly wasn't expecting. I don't see the same effect on macOS / Clang, whether compiling with --enable-optimizations or not; this appears to be a GCC innovation. And indeed, as Tim suggested, it turns out that there's no division instruction present in the loop for the division-by-10 case - we're doing division via multiplication by the reciprocal. In Python terms, we're computing `x // 10` as `(x * 0xcccccccccccccccd) >> 67`. Here's the tell-tale snippet of the assembly output from the second compilation (the one that makes use of the generated profile information) of longobject.c at commit 09087b8519316608b85131ee7455b664c00c38d2 on a Linux box, with GCC 11.2.0. I added a couple of comments, but it's otherwise unaltered  

.loc 1 1632 36 view .LVU12309
movl %r13d, %r11d
salq $2, %rbp
cmpl $10, %r13d # compare divisor 'n' with 10, and
jne .L2797  # go to the slow version if n != 10
leaq 1(%r10), %r9 # from here on, the divisor is 10
addq %rbp, %r8
.LVL3442:
.loc 1 1632 36 view .LVU12310
addq %rbp, %rdi
.LVL3443:
.loc 1 1632 36 view .LVU12311
.LBE8049:
.loc 1 1624 15 view .LVU12312
xorl %r13d, %r13d
.LVL3444:
.loc 1 1624 15 view .LVU12313
movabsq $-3689348814741910323, %r11 # magic constant 0xcccccccccccccccd for division by 10

and then a few lines later:

.loc 1 1630 9 is_stmt 1 view .LVU12316
.loc 1 1631 9 view .LVU12317
.loc 1 1631 39 is_stmt 0 view .LVU12318
movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of r14
.LVL3446:
.loc 1 1632 9 is_stmt 1 view .LVU12319
movq %r14, %rax # set up for division: top digit is now in rax
.loc 1 1633 13 is_stmt 0 view .LVU12320
movq %r14, %r13
mulq %r11 # here's the division by 10: multiply by the magic constant
shrq $3, %rdx # and divide by 8 (via a shift)

and then it all gets a bit repetitive and boring - there's a lot of loop unrolling going on. 

So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

TIL ...

-- 
Mark