As a bystander, this is all fascinating (I had actually anticipated that
the //10 optimization came from PGO).
Does the optimization for //10 actually help in the real world? It would if
people did a lot of manual conversion to decimal, which is easiest
expressed using //10. But presumably for that people mostly end up using
str() or repr(), which has its own custom code,
long_to_decimal_string_internal().
Then again I'm not sure what's *lost* even if this optimization is
pointless -- surely it doesn't slow other divisions down enough to be
measurable.
On Sun, Jan 16, 2022 at 12:35 PM Mark Dickinson
On Sun, Jan 16, 2022 at 12:08 PM Mark Dickinson
wrote: So gcc is anticipating divisions by 10 and introducing special-case divide-by-reciprocal-multiply code for that case, and presumably the profile generated for the PGO backs up this being a common enough case, so we end up with the above code in the final compilation.
Nope, that's not what's happening. This analysis is backwards, and unfairly attributes to GCC the apparently arbitrary choice to optimise division by 10. But it's not GCC's fault; it's ours. What's *actually* happening is that GCC is simply recording values for n used in calls to divrem1 (via the -fprofile-values option, which is implied by -fprofile-generate, which is used as a result of the --enable-optimizations configure script option). It's then noticing that in our profile task (which consists of a selection of Lib/test/test_*.py test files) we most often do divisions by 10, and so it optimizes that case.
To test this hypothesis I added a large number of tests for division by 17 in test_long.py, and then recompiled from scratch (again with --enable-optimizations). Here are the results:
root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=10" "x//y"
1000000 loops, best of 5: 1.14 usec per loop
root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=17" "x//y"
1000000 loops, best of 5: 306 nsec per loop
root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=1" "x//y"
1000000 loops, best of 5: 1.14 usec per loop
root@341b5fd44b23:/home/cpython# ./python -m timeit -n 1000000 -s "x=10**1000; y=2" "x//y"
1000000 loops, best of 5: 1.15 usec per loop
As expected, division by 17 is now optimised; division by 10 is as slow as division by other small scalars.
-- Mark
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2MOQCVME... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...