I did some more tests concentrating on GCC, partly based on the feedback I
got, results at

Executive summary: Python needs to be compiled with -O2 or -O3. Not doing
so, no optimization level, results with GCC 4.2.1 in a doubling of execution
time. Using just -O1 is still ~15% slower than using -O2.

Using -mtune=native -march=native can shave of 0,1/0,2 seconds, but
otherwise I did not find much difference in using having march or mfpmath

Profile-guided optimization did not help much, as might be expected, it
pushed about the same kind of optimization as the mtune/march combination.

