Bear in mind that what you see by way of CPU Speed is based on *sampling*, and the CPU can be switched speeds very quickly. Far faster than you'd necessarily see in your periodic updates. Also note that if your cooling isn't up to scratch for handling the CPU running permanently at its top normal speed, thermal throttling will cause the system to slow down independently of anything happening OS side. That's embedded within the chip and can't be disabled.
FWIW microbenchmarks are inherently unstable and susceptible to jitter on the system side. There's all sorts of things that could be interfering outside the scope of your tests, and because the benchmark is over and done with so quickly, if something does happen it's going to skew the entire benchmark run. If microbenchmarking really is the right thing for your needs, you should look at running enough runs to be able to get a fair idea of realistic performance. Think hundreds etc, then eliminating particularly fast and/or slow runs from your consideration, and whatever other things you might consider for statistical significance.
I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations.
Paul
On Tue, May 17, 2016 at 11:21:29PM +0200, Victor Stinner wrote:
According to a friend, my CPU model "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz" has a "Turbo Mode" which is enabled by default. The CPU tries to use the Turbo Mode whener possible, but disables it when the CPU is too hot. The change should be visible with the exact CPU frequency (the change can be a single MHz: 3400 => 3401). I didn't notice such minor CPU frequency change, but I didn't check carefully.
Anyway, I disabled the Turbo Mode and Hyperthreading in the EFI. It should avoid the strange performance "drop".
Victor
2016-05-17 16:44 GMT+02:00 Victor Stinner victor.stinner@gmail.com:
Hi,
I'm still having fun with microbenchmarks. I disabled Power States (pstate) of my Intel CPU and forced the frequency for 3.4 GHz. I isolated 2 physical cores on a total of 4. Timings are very stable *but* sometimes, I get impressive slowdown: like 60% or 80% slower, but only for a short time.
Do you know which CPU feature can explain such temporary slowdown?
I tried cpupower & powertop tools to try to learn more about internal CPU states, but I don't see anything obvious. I also noticed that powertop has a major side effect: it changes the speed of my CPU cores! Since the CPU cores used to run benchmarks are isolated, powertop uses a low speed (like 1.6 GHz, half speed) while benchmarks are running, probably because the kernel doesn't "see" the benchmark processes.
My CPU model is: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
I'm using "userspace" scaling governor for isolated CPU cores, but "ondemand" for other CPU cores.
I disabled pstate (kernel parameter: intel_pstate=disable), the CPU scaling driver is "acpi-cpufreq".
CPUs 2,3,6,7 are isolated.
In the following examples, the same microbenchmark takes ~196 ms on all cores, except of the core 3 on the first example.
Example 1:
$ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py -n 1 --timer perf_counter; done === CPU 0 === 0.19619656700160704 === CPU 1 === 0.19547197800056892 === CPU 2 === 0.19512042699716403 === CPU 3 === 0.35738898099953076 === CPU 4 === 0.19744606299718725 === CPU 5 === 0.195480646998476 === CPU 6 === 0.19495172200186062 === CPU 7 === 0.19495161599843414
Example 2:
$ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py -n 1 --timer perf_counter; done === CPU 0 === 0.19725238799946965 === CPU 1 === 0.19552089699936914 === CPU 2 === 0.19495758999983082 === CPU 3 === 0.19517506799820694 === CPU 4 === 0.1963375539999106 === CPU 5 === 0.19575440099652042 === CPU 6 === 0.19582506000006106 === CPU 7 === 0.19503543600148987
If I repeat the same test, timings are always ~196 ms on all cores.
It looks like some cores decide to sleep.
Victor
Speed mailing list Speed@python.org https://mail.python.org/mailman/listinfo/speed
On Wed, 18 May 2016 21:05:11 -0000, Paul Graydon paul@paulgraydon.co.uk wrote:
I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations.
My understanding is that Victor isn't using this to guide optimization, but rather to have a quick-as-possible way to find out that he screwed up when he made a code change. I'm sure he's using much longer benchmarks runs for actually looking at the performance impact of the complete changeset.
--David
FYI I'm running the CPython Benchmark Suite with:
taskset -c 1,3 python3 -u perf.py --rigorous ../ref_python/pgo/python ../fastcall/pgo/python -b all
I was asked to use --rigorous and -b all when I worked on other patches, like: https://bugs.python.org/issue21955#msg259431
2016-05-19 0:04 GMT+02:00 R. David Murray rdmurray@bitdance.com:
On Wed, 18 May 2016 21:05:11 -0000, Paul Graydon paul@paulgraydon.co.uk wrote:
I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations.
My understanding is that Victor isn't using this to guide optimization, but rather to have a quick-as-possible way to find out that he screwed up when he made a code change. I'm sure he's using much longer benchmarks runs for actually looking at the performance impact of the complete changeset.
Right, I don't use the benchmark suite to choose which parts of the code should be optimized, but only to ensure that my optimizations make Python faster, as expected :-)
But I understood what Paul wrote. He says that modifying a random parameter to make it constant (like random hash function) can lead to wrong conclusion on the patch. Depending on the chosen fixed value, the benchmark can say that the patch makes Pyhon faster or slower. Well, at least in corner cases, especially microbenchmarks like call_simple.
Victor
2016-05-18 23:05 GMT+02:00 Paul Graydon paul@paulgraydon.co.uk:
Bear in mind that what you see by way of CPU Speed is based on *sampling*, and the CPU can be switched speeds very quickly. Far faster than you'd necessarily see in your periodic updates. Also note that if your cooling isn't up to scratch for handling the CPU running permanently at its top normal speed, thermal throttling will cause the system to slow down independently of anything happening OS side. That's embedded within the chip and can't be disabled.
I checked the temperature of my CPU cores using the "sensors" command and it was somewhere around ~50°C which doesn't seem "too hot" to me. A better bet is that I was close the temperature switching between Turbo Mode or not.
I disabled Turbo Mode and Hyperthreading on my CPU and I didn't reproduce the random slowdown anymore.
I also misunderstood how Turbo Mode works. By default, a CPU uses the Turbo Mode, but disables it automatically if the CPU is too hot. I expected that the CPU doesn't use Turbo Mode, but start to use it after a few seconds if the CPU usage is high.
It looks like the performance also depends on the number of cores currently used: https://en.wikipedia.org/wiki/Intel_Turbo_Boost#Example
FWIW microbenchmarks are inherently unstable and susceptible to jitter on the system side.
Using CPU isolation helps a lot to reduce the noise coming from the "system".
If microbenchmarking really is the right thing for your needs, (...)
Someone asked me to check the perfomance of my patches using perf.py, so I'm using it. The accuracy of some specific benchmark of this benchmark suite is still an open question ;-)
... you should look at running enough runs to be able to get a fair idea of realistic performance.
Right, this idea was already discussed in other threads and already implemented in the PyPy flavor of perf.py. I also patched locally my perf.py to do that.
I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations.
Yeah, that's an excellent remark :-) It's not the first time that I read it. I think that it's ok to use CPU isolation and tune CPU options (ex: disable Turbo Mode) to reduce the noise. Other parameters like disabling hash randomization or disabling ASLR is more an open question.
It seems to me that disabling randomization (hash function, ASLR) introduces a risk of leading to the invalidate conclusion (patch makes Python faster / slower). But I read this advice many times, and perf.py currently explicitly disables hash randomization.
The most common trend in benchmarking is to disable all sources of noice and only care of the minimum (smallest timing). In my experience (of last weeks), it just doesn't work, at least for microbenchmarks.
Victor