[Barry Scott and Steve Dower share tips for convincing Visual Studio
 to show assembler without recompiling the file]

Thanks, fellows! That mostly ;-) workedl. Problem remaining is that breakpoints just didn't work. They showed up "visually", and in the table of set breakpoints, but code went whizzing right by them every time.

I didn't investigate. It's possible, e.g., that the connection between C source and the generated PGO code was so obscure that VS just gave up - or just blew it.

Instead I wrote a Python loop to run a division of interest "forever". That way I hoped I'd be likely to land in the loop of interest by luck when I broke into the process.

Which worked! So here's the body of the main loop:

00007FFE451D2760  mov         eax,dword ptr [rcx-4]  
00007FFE451D2763  lea         rcx,[rcx-4]  
00007FFE451D2767  shl         r9,1Eh  
00007FFE451D276B  or          r9,rax  
00007FFE451D276E  cmp         r8,0Ah  
00007FFE451D2772  jne         long_div+25Bh (07FFE451D27CBh)  
00007FFE451D2774  mov         rax,rdi  
00007FFE451D2777  mul         rax,r9  
00007FFE451D277A  mov         rax,rdx  
00007FFE451D277D  shr         rax,3  
00007FFE451D2781  mov         dword ptr [r10+rcx],eax  
00007FFE451D2785  mov         eax,eax  
00007FFE451D2787  imul        rax,r8  
00007FFE451D278B  sub         r9,rax  
00007FFE451D278E  sub         rbx,1  
00007FFE451D2792  jns         long_div+1F0h (07FFE451D2760h)  

And above the loop is this line, which you'll recognize as loading the same scaled reciprocal of 10 as the gcc code Mark posted earlier. The code above moves %rdi into %rax before the mul instruction:

00007FFE451D2747  mov         rdi,0CCCCCCCCCCCCCCCDh

Note an odd decision here:the MS code compares the divisor to 10 on _every_ iteration. There are not two, "10 or not 10?", loop; bodies. Instead, if the divisor isn't 10, "jne long_div+25Bh" jumps to code not shown here, a few instructions that use hardware division, and then jump back into the tail end of the loop above to finish computing the remainder (etc).

So they not only optimized division by 10, they added a useless test and two branches to every iteration of the loop when we're not dividing by 10 ;-)