On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson <dickinsm@gmail.com> wrote:
So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

Nope, that's not what's happening. This analysis is backwards, and unfairly attributes to GCC the apparently arbitrary choice to optimise division by 10. But it's not GCC's fault; it's ours. What's actually happening is that GCC is simply recording values for n used in calls to divrem1 (via the -fprofile-values option, which is implied by -fprofile-generate, which is used as a result of the --enable-optimizations configure script option). It's then noticing that in our profile task (which consists of a selection of Lib/test/test_*.py test files) we most often do divisions by 10, and so it optimizes that case.

To test this hypothesis I added a large number of tests for division by 17 in test_long.py, and then recompiled from scratch (again with --enable-optimizations). Here are the results:

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=10" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=17" "x//y"

1000000 loops, best of 5: 306 nsec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=1" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=2" "x//y"

1000000 loops, best of 5: 1.15 usec per loop


As expected, division by 17 is now optimised; division by 10 is as slow as division by other small scalars.

-- 
Mark