Does the optimization for //10 actually help in the real world? [...]
Yep, I don't know. If 10 is not the most common small divisor in real world code, it must at least rank in the top five. I might hazard a guess that division by 2 would be more common, but I've no idea how one would go about establishing that.
All of the int constants relating to time and date calculations show up frequently as well. But I'd assume -fprofile-values isn't likely to pick many to specialize on to avoid adding branches so maybe 10 is ironically it. --enable-optimizations with clang doesn't trigger value specialization (I'm pretty sure they support the concept, but I've never looked at how).
The reason that the divisor of 10 is turning up from the PGO isn't a particularly convincing one - it looks as though it's a result of our testing the builtin int-to-decimal-string conversion by comparing with an obviously-correct repeated-division-by-10 algorithm.
Then again I'm not sure what's *lost* even if this optimization is pointless -- surely it doesn't slow other divisions down enough to be measurable.
Agreed. That at least is testable. I can run some timings (but not tonight).
BTW, I am able to convince clang 11 and higher to produce a 64:32 divide instruction with a modified version of the code. Basically just taking your assembly divl variant as an example and writing that explicitly as the operations in C:
20% faster microbenchmarking with x//1 or x//17 or other non-specialized divide values. similar speedup even in --enable-optimizations builds. with both gcc9 and clang13.
The compilers seem happier optimizing that code.
-gps