On Sat, Jan 15, 2022 at 8:12 PM Tim Peters <tim.peters@gmail.com> wrote:

Something is missing here, but can't guess what without seeing the

generated machine code.But I trust Mark will do that.

Welp, there goes my weekend. :-)

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//10"

1500000 loops, best of 5: 376 nsec per loop

Which actually makes little sense to me. [...] Under 4 nsec per iteration seems

close to impossibly fast on a 3.8GHz box, given the presence of any division instruction.

<snip>

However, dividing by 10 is not a worst case on this box. Dividing by

100 is over 3x slower:

$ python -m timeit -n 1500000 -s "x = 10**1000" "x//100"

1500000 loops, best of 5: 1.25 usec per loop

Now *that* I certainly wasn't expecting. I don't see the same effect on macOS / Clang, whether compiling with --enable-optimizations or not; this appears to be a GCC innovation. And indeed, as Tim suggested, it turns out that there's no division instruction present in the loop for the division-by-10 case - we're doing division via multiplication by the reciprocal. In Python terms, we're computing `x // 10` as `(x * 0xcccccccccccccccd) >> 67`. Here's the tell-tale snippet of the assembly output from the second compilation (the one that makes use of the generated profile information) of longobject.c at commit 09087b8519316608b85131ee7455b664c00c38d2 on a Linux box, with GCC 11.2.0. I added a couple of comments, but it's otherwise unaltered

.loc 1 1632 36 view .LVU12309

movl %r13d, %r11d

salq $2, %rbp

cmpl $10, %r13d # compare divisor 'n' with 10, and

jne .L2797 # go to the slow version if n != 10

leaq 1(%r10), %r9 # from here on, the divisor is 10

addq %rbp, %r8

.LVL3442:

.loc 1 1632 36 view .LVU12310

addq %rbp, %rdi

.LVL3443:

.loc 1 1632 36 view .LVU12311

.LBE8049:

.loc 1 1624 15 view .LVU12312

xorl %r13d, %r13d

.LVL3444:

.loc 1 1624 15 view .LVU12313

movabsq $-3689348814741910323, %r11 # magic constant 0xcccccccccccccccd for division by 10

.loc 1 1630 9 is_stmt 1 view .LVU12316

.loc 1 1631 9 view .LVU12317

.loc 1 1631 39 is_stmt 0 view .LVU12318

movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of r14

.LVL3446:

.loc 1 1632 9 is_stmt 1 view .LVU12319

movq %r14, %rax # set up for division: top digit is now in rax

.loc 1 1633 13 is_stmt 0 view .LVU12320

movq %r14, %r13

mulq %r11 # here's the division by 10: multiply by the magic constant

shrq $3, %rdx # and divide by 8 (via a shift)

and then it all gets a bit repetitive and boring - there's a lot of loop unrolling going on.

So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

TIL ...

--

Mark