[Barry Scott and Steve Dower share tips for convincing Visual Studio
to show assembler without recompiling the file]

Thanks, fellows! That mostly ;-) workedl. Problem remaining is that breakpoints just didn't work. They showed up "visually", and in the table of set breakpoints, but code went whizzing right by them every time.

I didn't investigate. It's possible, e.g., that the connection between C source and the generated PGO code was so obscure that VS just gave up - or just blew it.

Instead I wrote a Python loop to run a division of interest "forever". That way I hoped I'd be likely to land in the loop of interest by luck when I broke into the process.

Which worked! So here's the body of the main loop:

00007FFE451D2760 mov eax,dword ptr [rcx-4]
00007FFE451D2763 lea rcx,[rcx-4]
00007FFE451D2767 shl r9,1Eh
00007FFE451D276B or r9,rax
00007FFE451D276E cmp r8,0Ah
00007FFE451D2772 jne long_div+25Bh (07FFE451D27CBh)
00007FFE451D2774 mov rax,rdi
00007FFE451D2777 mul rax,r9
00007FFE451D277A mov rax,rdx
00007FFE451D277D shr rax,3
00007FFE451D2781 mov dword ptr [r10+rcx],eax
00007FFE451D2785 mov eax,eax
00007FFE451D2787 imul rax,r8
00007FFE451D278B sub r9,rax
00007FFE451D278E sub rbx,1
00007FFE451D2792 jns long_div+1F0h (07FFE451D2760h)

And above the loop is this line, which you'll recognize as loading the same scaled reciprocal of 10 as the gcc code Mark posted earlier. The code above moves %rdi into %rax before the mul instruction:

00007FFE451D2747 mov rdi,0CCCCCCCCCCCCCCCDh

Note an odd decision here:the MS code compares the divisor to 10 on _every_ iteration. There are not two, "10 or not 10?", loop; bodies. Instead, if the divisor isn't 10, "jne long_div+25Bh" jumps to code not shown here, a few instructions that use hardware division, and then jump back into the tail end of the loop above to finish computing the remainder (etc).

So they not only optimized division by 10, they added a useless test and two branches to every iteration of the loop when we're not dividing by 10 ;-)