On Sat, Jan 15, 2022 at 8:12 PM Tim Peters <tim.peters@gmail.com> wrote:
Something is missing here, but can't guess what without seeing the generated machine code.But I trust Mark will do that.
Welp, there goes my weekend. :-) $ python -m timeit -n 1500000 -s "x = 10**1000" "x//10" 1500000 loops, best of 5: 376 nsec per loop
Which actually makes little sense to me. [...] Under 4 nsec per iteration seems
close to impossibly fast on a 3.8GHz box, given the presence of any
division instruction.
<snip> However, dividing by 10 is not a worst case on this box. Dividing by
100 is over 3x slower:
$ python -m timeit -n 1500000 -s "x = 10**1000" "x//100" 1500000 loops, best of 5: 1.25 usec per loop
Now *that* I certainly wasn't expecting. I don't see the same effect on macOS / Clang, whether compiling with --enable-optimizations or not; this appears to be a GCC innovation. And indeed, as Tim suggested, it turns out that there's no division instruction present in the loop for the division-by-10 case - we're doing division via multiplication by the reciprocal. In Python terms, we're computing `x // 10` as `(x * 0xcccccccccccccccd) >> 67`. Here's the tell-tale snippet of the assembly output from the second compilation (the one that makes use of the generated profile information) of longobject.c at commit 09087b8519316608b85131ee7455b664c00c38d2 <https://github.com/python/cpython/blob/09087b8519316608b85131ee7455b664c00c38d2/Objects/longobject.c> on a Linux box, with GCC 11.2.0. I added a couple of comments, but it's otherwise unaltered .loc 1 1632 36 view .LVU12309 movl %r13d, %r11d salq $2, %rbp cmpl $10, %r13d # compare divisor 'n' with 10, and jne .L2797 # go to the slow version if n != 10 leaq 1(%r10), %r9 # from here on, the divisor is 10 addq %rbp, %r8 .LVL3442: .loc 1 1632 36 view .LVU12310 addq %rbp, %rdi .LVL3443: .loc 1 1632 36 view .LVU12311 .LBE8049: .loc 1 1624 15 view .LVU12312 xorl %r13d, %r13d .LVL3444: .loc 1 1624 15 view .LVU12313 movabsq $-3689348814741910323, %r11 # magic constant 0xcccccccccccccccd for division by 10 and then a few lines later: .loc 1 1630 9 is_stmt 1 view .LVU12316 .loc 1 1631 9 view .LVU12317 .loc 1 1631 39 is_stmt 0 view .LVU12318 movl (%r8,%r10,4), %r14d # move top digit of divisor into the low word of r14 .LVL3446: .loc 1 1632 9 is_stmt 1 view .LVU12319 movq %r14, %rax # set up for division: top digit is now in rax .loc 1 1633 13 is_stmt 0 view .LVU12320 movq %r14, %r13 mulq %r11 # here's the division by 10: multiply by the magic constant shrq $3, %rdx # and divide by 8 (via a shift) and then it all gets a bit repetitive and boring - there's a lot of loop unrolling going on. So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation. TIL ... -- Mark