Mailman 3 May 2016 - Speed

Re: [Speed] Performance comparison of regular expression engines
by Serhiy Storchaka 12 Jun '16

12 Jun '16

On 06.03.16 11:30, Maciej Fijalkowski wrote: > this is really difficult to read, can you tell me which column am I looking at? The first column is the searched pattern. The second column is the number of found matches (for control, it should be the same with all engines and versions). The third column, under the "re" header is the time in milliseconds. The column under the "str.find" header is the time of searching without using regular expressions. PyPy 2.2 usually is significantly faster than CPython 2.7, except searching plain string with regular expression. But thanks to Flexible String Representation searching plain string with and without regular expression is faster on CPython 3.6.

2 1

Re: [Speed] External sources of noise changing call_simple "performance"
by Antoine Pitrou 31 May '16

31 May '16

On Tue, 31 May 2016 14:41:55 +0200 Victor Stinner <victor.stinner(a)gmail.com> wrote: > 2016-05-30 10:14 GMT+02:00 Antoine Pitrou <solipsis(a)pitrou.net>: > >> I'm still (!) investigating the reasons why the benchmark call_simple > >> (ok, let's be honest: the *micro*benchmark) gets different results for > >> unknown reasons. > > > > Try to define MCACHE_STATS in Objects/typeobject.c and observe the > > statistics from run to run. It might give some hints. > > call_simple only uses regular functions, not methods, so the type > cache should not have any effect on it. No? Indeed, sorry for the mistake. Regards Antoine.

1 0

Re: [Speed] External sources of noise changing call_simple "performance"
by Antoine Pitrou 31 May '16

31 May '16

On Tue, 17 May 2016 23:11:50 +0200 Victor Stinner <victor.stinner(a)gmail.com> wrote: > Hi, > > I'm still (!) investigating the reasons why the benchmark call_simple > (ok, let's be honest: the *micro*benchmark) gets different results for > unknown reasons. Try to define MCACHE_STATS in Objects/typeobject.c and observe the statistics from run to run. It might give some hints. Regards Antoine.

2 1

Re: [Speed] External sources of noise changing call_simple "performance"
by Victor Stinner 19 May '16

19 May '16

2016-05-17 23:11 GMT+02:00 Victor Stinner <victor.stinner(a)gmail.com>: > (...) > > (*) System load => CPU isolation, disable ASLR, set CPU affinity on > IRQs, etc. work around this issue -- > http://haypo-notes.readthedocs.io/microbenchmark.html > > (... > > (*) Locale, size of the command line and/or the current working > directory => WTF?! > (...) > => My bet is that the locale, current working directory, command line, > etc. impact how the heap memory is allocated, and this specific > benchmark depends on the locality of memory allocated on the heap... > (...) I tried to find a tool to "randomize" memory allocations, but I failed to find a popular and simple tool. I found the following tool, but it seems overkill and not realistic to me: https://emeryberger.com/research/stabilizer/ This tool randomizes everything and "re-randomize" the code at runtime, every 500 ms. IMHO it's not realistic because PGO+LTO use a specific link order to group "hot code" to make hot functions close. It seems like (enabling) ASLR "hides" the effects of the comand line, current working directory, environment variables, etc. Using ASLR + statistics (compute mean + standard deviation, use multiple processes to get a better distribution) fixes my issue. Slowly, I understand better why using the minimum and disabling legit sources of randomness is wrong. I mean that slowly I'm able to explain why :-) It looks like disabling ASLR and focusing on the minimum timing is just wrong. I'm surprised because disabling ASLR is a common practice in benchmarking. For example, on this mailing list, 2 months ago, Alecsandru Patrascu from Intel suggested to disable ASLR: https://mail.python.org/pipermail/speed/2016-February/000289.html (and also to disable Turbo, Hyper Threading and use a fixed CPU frequency which are good advices ;-)) By the way, I'm interested to know how the server running speed.python.org is tuned: CPU tuning, OS tuning, etc. For example, Zachary Ware wrote that perf.py was not run with --rigorous when he launched the website. I will probably write a blog post to explain my issues with benchmarks. Later, I will propose more concrete changes to perf.py and write doc explaining how perf.py should be used (give advices how to get reliable results). Victor

1 0

Re: [Speed] CPU speed of one core changes for unknown reason
by Paul Graydon 19 May '16

19 May '16

Bear in mind that what you see by way of CPU Speed is based on *sampling*, and the CPU can be switched speeds very quickly. Far faster than you'd necessarily see in your periodic updates. Also note that if your cooling isn't up to scratch for handling the CPU running permanently at its top normal speed, thermal throttling will cause the system to slow down independently of anything happening OS side. That's embedded within the chip and can't be disabled. FWIW microbenchmarks are inherently unstable and susceptible to jitter on the system side. There's all sorts of things that could be interfering outside the scope of your tests, and because the benchmark is over and done with so quickly, if something does happen it's going to skew the entire benchmark run. If microbenchmarking really is the right thing for your needs, you should look at running enough runs to be able to get a fair idea of realistic performance. Think hundreds etc, then eliminating particularly fast and/or slow runs from your consideration, and whatever other things you might consider for statistical significance. I do have some concerns that you're increasingly creating a synthetic environment to benchmark against, and that you're at risk of optimising towards an environment the code won't actually run in, and might even end up pursuing the wrong optimisations. Paul On Tue, May 17, 2016 at 11:21:29PM +0200, Victor Stinner wrote: > According to a friend, my CPU model "Intel(R) Core(TM) i7-2600 CPU @ > 3.40GHz" has a "Turbo Mode" which is enabled by default. The CPU tries > to use the Turbo Mode whener possible, but disables it when the CPU is > too hot. The change should be visible with the exact CPU frequency > (the change can be a single MHz: 3400 => 3401). I didn't notice such > minor CPU frequency change, but I didn't check carefully. > > Anyway, I disabled the Turbo Mode and Hyperthreading in the EFI. It > should avoid the strange performance "drop". > > Victor > > 2016-05-17 16:44 GMT+02:00 Victor Stinner <victor.stinner(a)gmail.com>: > > Hi, > > > > I'm still having fun with microbenchmarks. I disabled Power States > > (pstate) of my Intel CPU and forced the frequency for 3.4 GHz. I > > isolated 2 physical cores on a total of 4. Timings are very stable > > *but* sometimes, I get impressive slowdown: like 60% or 80% slower, > > but only for a short time. > > > > Do you know which CPU feature can explain such temporary slowdown? > > > > I tried cpupower & powertop tools to try to learn more about internal > > CPU states, but I don't see anything obvious. I also noticed that > > powertop has a major side effect: it changes the speed of my CPU > > cores! Since the CPU cores used to run benchmarks are isolated, > > powertop uses a low speed (like 1.6 GHz, half speed) while benchmarks > > are running, probably because the kernel doesn't "see" the benchmark > > processes. > > > > My CPU model is: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > > > > I'm using "userspace" scaling governor for isolated CPU cores, but > > "ondemand" for other CPU cores. > > > > I disabled pstate (kernel parameter: intel_pstate=disable), the CPU > > scaling driver is "acpi-cpufreq". > > > > CPUs 2,3,6,7 are isolated. > > > > In the following examples, the same microbenchmark takes ~196 ms on > > all cores, except of the core 3 on the first example. > > > > Example 1: > > --- > > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > > -n 1 --timer perf_counter; done > > === CPU 0 === > > 0.19619656700160704 > > === CPU 1 === > > 0.19547197800056892 > > === CPU 2 === > > 0.19512042699716403 > > === CPU 3 === > > 0.35738898099953076 > > === CPU 4 === > > 0.19744606299718725 > > === CPU 5 === > > 0.195480646998476 > > === CPU 6 === > > 0.19495172200186062 > > === CPU 7 === > > 0.19495161599843414 > > --- > > > > Example 2: > > --- > > $ for cpu in $(seq 0 7); do echo "=== CPU $cpu ==="; PYTHONHASHSEED=0 > > taskset -c $cpu ../fastcall/pgo/python performance/bm_call_simple.py > > -n 1 --timer perf_counter; done > > === CPU 0 === > > 0.19725238799946965 > > === CPU 1 === > > 0.19552089699936914 > > === CPU 2 === > > 0.19495758999983082 > > === CPU 3 === > > 0.19517506799820694 > > === CPU 4 === > > 0.1963375539999106 > > === CPU 5 === > > 0.19575440099652042 > > === CPU 6 === > > 0.19582506000006106 > > === CPU 7 === > > 0.19503543600148987 > > --- > > > > If I repeat the same test, timings are always ~196 ms on all cores. > > > > It looks like some cores decide to sleep. > > > > Victor > _______________________________________________ > Speed mailing list > Speed(a)python.org > https://mail.python.org/mailman/listinfo/speed

3 3

Re: [Speed] External sources of noise changing call_simple "performance"
by Victor Stinner 19 May '16

19 May '16

2016-05-18 20:54 GMT+02:00 Maciej Fijalkowski <fijall(a)gmail.com>: >> Ok. I'm not sure yet that it's feasible to get exactly the same memory >> addresses for "hot" objects allocated by Python between two versions >> of the code (...) > > Well the answer is to do more statistics really in my opinion. That > is, perf should report average over multiple runs in multiple > processes. I started a branch for pypy benchmarks for that, but never > finished it actually. I'm not sure that I understood you correctly. As I wrote, running the same benchmark twice using two processes gives exactly the same timing. I already modified perf.py locally to run multiple processes and focus on the average + std dev rather than min of a single process. Example: run 10 process x 3 loops (total: 30) Run average: 205.4 ms +/- 0.1 ms (min: 205.3 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.2 ms +/- 0.0 ms (min: 205.2 ms, max: 205.3 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.4 ms +/- 0.1 ms (min: 205.3 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.2 ms (min: 205.1 ms, max: 205.4 ms) Run average: 205.2 ms +/- 0.1 ms (min: 205.1 ms, max: 205.2 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Run average: 205.3 ms +/- 0.1 ms (min: 205.2 ms, max: 205.4 ms) Total average: 205.3 ms +/- 0.1 ms (min: 205.1 ms, max: 205.4 ms) The "total" concatenates all lists of timings. Note: Oh, by the way, the timing also depends on the presence of .pyc files ;-) I modified perf.py to add a first run with a single iteration just to rebuild .pyc, since the benchmark always start by removing alll .pyc files... Victor

1 0

Re: [Speed] External sources of noise changing call_simple "performance"
by Maciej Fijalkowski 19 May '16

19 May '16

On Wed, May 18, 2016 at 1:16 PM, Victor Stinner <victor.stinner(a)gmail.com> wrote: > 2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski <fijall(a)gmail.com>: >> I think you misunderstand how caches work. The way caches work depends >> on the addresses of memory (their value) which even with ASLR disabled >> can differ between runs. Then you either do or don't have cache >> collisions. > > Ok. I'm not sure yet that it's feasible to get exactly the same memory > addresses for "hot" objects allocated by Python between two versions > of the code (especially when testing a small patch). Not only the > addresses look to depend on external parameters, but the patch can > also adds or avoids some memory allocations. > > The concrete problem is that the benchmark depends on such low-level > CPU feature and the perf.py doesn't ignore minor delta in performance, > no? > > Victor Well the answer is to do more statistics really in my opinion. That is, perf should report average over multiple runs in multiple processes. I started a branch for pypy benchmarks for that, but never finished it actually.

1 0

Re: [Speed] External sources of noise changing call_simple "performance"
by Victor Stinner 18 May '16

18 May '16

2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski <fijall(a)gmail.com>: > I think you misunderstand how caches work. The way caches work depends > on the addresses of memory (their value) which even with ASLR disabled > can differ between runs. Then you either do or don't have cache > collisions. Ok. I'm not sure yet that it's feasible to get exactly the same memory addresses for "hot" objects allocated by Python between two versions of the code (especially when testing a small patch). Not only the addresses look to depend on external parameters, but the patch can also adds or avoids some memory allocations. The concrete problem is that the benchmark depends on such low-level CPU feature and the perf.py doesn't ignore minor delta in performance, no? Victor

1 0

Re: [Speed] External sources of noise changing call_simple "performance"
by Victor Stinner 18 May '16

18 May '16

2016-05-18 10:45 GMT+02:00 Armin Rigo <arigo(a)tunes.org>: > On 17 May 2016 at 23:11, Victor Stinner <victor.stinner(a)gmail.com> wrote: >> with PYTHONHASHSEED=1 to test the same hash function. A more generic >> solution is to use multiple processes to test multiple hash seeds to >> get a better uniform distribution. > > What you say in the rest of the mail just shows that this "generic > solution" should be applied not only to PYTHONHASHSEED, but also to > other variables that seem to introduce deterministic noise. Right. ... or ensure that these other parameters are not changed when testing two versions of the code ;-) perf.py already starts the process with an empty environment and set PYTHONHASHSEED: the environment is fixed (constant). I noticed the difference of performance with the environment because I failed to reproduce the benchmark (I got different numbers) when I ran again the benchmark manually. > You've > just found three more: the locale, the size of the command line, and > the working directory. I guess the mere size of the environment also > plays a role. So I guess, ideally, you'd run a large number of times > with random values in all these parameters. (In practice it might be > enough to run a smaller fixed number of times with known values in the > parameters.) Right, I have to think about that, try to find a way to randomize these "parameters" (or find a way to make them constants): directories, name of the binary, etc. As I wrote, the environment is easy to control. The working directory and the command line, it's more complex. It's convenient to be able to pass links to two different Python binaries compiled in two different directories. FYI I'm using a "reference python" compiled in one directory, and my "patched python" in a different directory. Both are compiled using the same compiler options (I'm using -O0 for debug, -O3 for quick benchmark, -O3 with PGO and LTO for reliable benchmarks). -- Another option for microbenchmarks would be to *ignore* (hide) differences smaller than +/- 10%, since such kind of benchmark depends too much on external parameters. I did that in my custom microbenchmark runner, it helps to ignore noise and focus on major speedup (or slowdown!). Victor

1 0

Re: [Speed] External sources of noise changing call_simple "performance"
by Victor Stinner 18 May '16

18 May '16

2016-05-18 8:55 GMT+02:00 Maciej Fijalkowski <fijall(a)gmail.com>: > I think you misunderstand how caches work. The way caches work depends > on the addresses of memory (their value) which even with ASLR disabled > can differ between runs. Then you either do or don't have cache > collisions. How about you just accept the fact that there is a > statistical distribution of the results on not the concrete "right" > result? Slowly, I understood that running multiple processes are needed to get a better statistical distribution. Ok. But I found a very specific case where the result depends on the command line, and the command line is constant. Running the benchmark once or 1 million of times doesn't reduce the effect of this parameter, since the effect is constant. Victor

1 0