As a bystander, this is all fascinating (I had actually anticipated that the //10 optimization came from PGO).

Does the optimization for //10 actually help in the real world? It would if people did a lot of manual conversion to decimal, which is easiest expressed using //10. But presumably for that people mostly end up using str() or repr(), which has its own custom code, long_to_decimal_string_internal().

Then again I'm not sure what's *lost* even if this optimization is pointless -- surely it doesn't slow other divisions down enough to be measurable.

On Sun, Jan 16, 2022 at 12:35 PM Mark Dickinson <dickinsm@gmail.com> wrote:

On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson <dickinsm@gmail.com> wrote:
So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.

Nope, that's not what's happening. This analysis is backwards, and unfairly attributes to GCC the apparently arbitrary choice to optimise division by 10. But it's not GCC's fault; it's ours. What's actually happening is that GCC is simply recording values for n used in calls to divrem1 (via the -fprofile-values option, which is implied by -fprofile-generate, which is used as a result of the --enable-optimizations configure script option). It's then noticing that in our profile task (which consists of a selection of Lib/test/test_*.py test files) we most often do divisions by 10, and so it optimizes that case.

To test this hypothesis I added a large number of tests for division by 17 in test_long.py, and then recompiled from scratch (again with --enable-optimizations). Here are the results:

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=10" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=17" "x//y"

1000000 loops, best of 5: 306 nsec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=1" "x//y"

1000000 loops, best of 5: 1.14 usec per loop

root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=2" "x//y"

1000000 loops, best of 5: 1.15 usec per loop

As expected, division by 17 is now optimised; division by 10 is as slow as division by other small scalars.

--
Mark

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2MOQCVMEQBV7PATT47GUYHS42QIJHTRK/
Code of Conduct: http://python.org/psf/codeofconduct/