On 06.03.16 11:30, Maciej Fijalkowski wrote:
> this is really difficult to read, can you tell me which column am I looking at?
The first column is the searched pattern. The second column is the
number of found matches (for control, it should be the same with all
engines and versions). The third column, under the "re" header is the
time in milliseconds. The column under the "str.find" header is the time
of searching without using regular expressions.
PyPy 2.2 usually is significantly faster than CPython 2.7, except
searching plain string with regular expression. But thanks to Flexible
String Representation searching plain string with and without regular
expression is faster on CPython 3.6.
Hi,
2016-04-27 20:30 GMT+02:00 Brett Cannon <brett(a)python.org>:
> My first intuition is some cache somewhere is unhappy w/ the varying sizes.
> Have you tried any of this on another machine to see if the results are
> consistent?
On my laptop, the performance when I add deadcode doesn't seem to
change much: the delta is smaller than 1%.
I found a fix for my deadcode issue! Use "make profile-opt" rather
than "make". Using PGO, GCC reorders hot functions to make them
closer. I also read that it records statistics on branches to emit
first the most frequent branch.
I also modified bm_call_simple.py to use multiple processes and to use
random hash seeds, rather than using a single process and disabling
hash randomization.
Comparison reference => fastcall (my whole fork, not just the tiny
patches adding deadcode) using make (gcc -O3):
Average: 1183.5 ms +/- 6.1 ms (min: 1173.3 ms, max: 1201.9 ms) -
15 processes x 5 loops
=> Average: 1121.2 ms +/- 7.4 ms (min: 1106.5 ms, max: 1142.0 ms) - 15
processes x 5 loops
Comparison reference => fastcall using make profile-opt (PGO):
Average: 962.7 ms +/- 17.8 ms (min: 952.6 ms, max: 998.6 ms) - 15
processes x 5 loops
=> Average: 961.1 ms +/- 18.6 ms (min: 949.0 ms, max: 1011.3 ms) - 15
processes x 5 loops
Using make, fastcall *seems* to be faster, but in fact it looks more
like random noise of deadcode. Using PGO, fastcall doesn't change
performance at all. I expected fastcall to be faster, but it's the
purpose of benchmarks: get real performance, not expectations :-)
Next step: modify most benchmarks of perf.py to run multiple processes
rather than a single process to test using multiple hash seeds.
Victor
My first intuition is some cache somewhere is unhappy w/ the varying sizes.
Have you tried any of this on another machine to see if the results are
consistent?
On Wed, 27 Apr 2016 at 08:06 Victor Stinner <victor.stinner(a)gmail.com>
wrote:
> Hi,
>
> I'm working on an experimental change of CPython introducing a new
> "fast call" calling convention for Python and C functions. It pass an
> array of PyObject* and a number of arguments as an C int (PyObject
> **stack, int nargs) instead of using a temporary tuple (PyObject
> *args). The expectation is that avoiding the creation makes Python
> faster.
> http://bugs.python.org/issue26814
>
> First microbenchmarks on optimized code are promising: between 18% and
> 44% faster.
> http://bugs.python.org/issue26814#msg263999
> http://bugs.python.org/issue26814#msg264003
>
> But I was quickly blocked on "macrobenchmarks" (?): running the Python
> benchmark suite says that many benchmarks are between 2% and 15%
> slower. I spent hours (days) to investigate the issue using
> Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was
> unable to understand how my change can makes CPython slower.
>
> My change is quite big: "34 files changed, 3301 insertions(+), 730
> deletions(-)". In fact, the performance regression can be reproduced
> easily with a few lines of C code: see attached patches. You only have
> to add some *unused* (dead) code to see a "glitch" in performance.
> It's even worse: the performance change depends on the size of unused
> code.
>
> I done my best to isolate the microbenchmark to make it as reliable as
> possible. Results of bm_call_simple on my desktop PC:
>
> (a) Reference:
> Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms)
>
> (b) Add 2 unused functions, based on (a):
> Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms)
>
> (c) Add 1 unused short function ("return NULL;"), based on (a):
> Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms)
>
> (b) and (c) are 2 versions only adding unused code to (a). The
> difference between (b) and (c) is the size of unused code. The problem
> is that (b) makes the code slower and (c) makes the code faster (!),
> whereas I would not expect any performance change.
>
> A sane person should ignore such minor performance delta (+72 ms = +6%
> // -31.4 ms = -3%). Right. But for optimization patches on CPython,
> we use the CPython benchmark suite as a proof that yeah, the change
> really makes CPython faster, as announced.
>
> I compiled the C code using GCC (5.3) and Clang (3.7) using various
> options: -O0, -O3, -fno-align-functions, -falign-functions=N (with
> N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the
> performance looks "random". I'm unable to correlate the performance
> with any Linux perf event. IMHO the performance depends on something
> low level like L1 cache, CPU pipeline, branch prediction, etc. As I
> wrote, I'm unable to verify that.
>
> To reproduce my issue, you can use the following commands:
> ---------------------------
> hg clone https://hg.python.org/cpython fastcall
> # or: "hg clone (...)/cpython fastcall"
> # if you already have a local copy of cpython ;-)
> cd fastcall
> ./configure -C
>
> # build reference binary
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> make && mv python python-ref
>
> # build binary with deadcode 1
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> patch -p1 < deadcode1.patch
> make && mv python python-deadcode1
>
> # build binary with deadcode 2
> hg up -C -r 496e094f4734
> patch -p1 < prepare.patch
> patch -p1 < deadcode2.patch
> make && mv python python-deadcode2
>
> # run benchmark
> PYTHONHASHSEED=0 ./python-ref bm_call_simple.py
> PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py
> PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py
> ---------------------------
>
> It suggest you to isolate at least one CPU and run the benchmark on
> isolated CPUs to get reliable timings:
> ---------------------------
> # run benchmark on the CPU #2
> PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
> PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
> PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
> ---------------------------
> My notes on CPU isolation:
> http://haypo-notes.readthedocs.org/microbenchmark.html
>
> If you don't want to try CPU isolation, try to get an idle system
> and/or run the benchmark many times until the standard deviation (the
> "+/- ..." part) looks small enough...
>
> Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you
> will get random results depending on the secret hash key used by the
> randomized hash function. (Or modify the code to spawn enough child
> process to get an uniform distribution ;-))
>
> I don't expect that you get the same numbers than me. For example, on
> my laptop, the delta is very small (+/- 1%):
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
> Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms)
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
> Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms)
> => +1% (+13 ms)
>
> $ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
> Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms)
> => -2% (-24 ms)
>
> CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4
> physical cores with hyper-threading
> CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2
> physical cores with hyper-threading
>
> I modified bm_call_simple.py to call foo() 100 times rather than 20 in
> the loop to see the issue more easily. I also removed dependencies and
> changed the output format to display average, standard deviation,
> minimum and maximum.
>
> For more benchmarks, see attached deadcode1.log and deadcode2.log:
> results of the CPython benchmark to compare deadcode1 VS reference,
> and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU
> isolation). Again, deadcode1 looks slower in most cases, whereas
> deadcode2 looks faster in most cases, whereas the difference is still
> dead code...
>
> Victor, disappointed
> _______________________________________________
> Speed mailing list
> Speed(a)python.org
> https://mail.python.org/mailman/listinfo/speed
>
> For more benchmarks, see attached deadcode1.log and deadcode2.log:
> results of the CPython benchmark to compare deadcode1 VS reference,
> and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU
> isolation). Again, deadcode1 looks slower in most cases, whereas
> deadcode2 looks faster in most cases, whereas the difference is still
> dead code...
Sorry, I forgot to attach these two files. They are now attached to
this new email.
Victor
Hi,
I'm working on an experimental change of CPython introducing a new
"fast call" calling convention for Python and C functions. It pass an
array of PyObject* and a number of arguments as an C int (PyObject
**stack, int nargs) instead of using a temporary tuple (PyObject
*args). The expectation is that avoiding the creation makes Python
faster.
http://bugs.python.org/issue26814
First microbenchmarks on optimized code are promising: between 18% and
44% faster.
http://bugs.python.org/issue26814#msg263999http://bugs.python.org/issue26814#msg264003
But I was quickly blocked on "macrobenchmarks" (?): running the Python
benchmark suite says that many benchmarks are between 2% and 15%
slower. I spent hours (days) to investigate the issue using
Cachegrind, Callgrind, Linux perf, strace, ltrace, etc., but I was
unable to understand how my change can makes CPython slower.
My change is quite big: "34 files changed, 3301 insertions(+), 730
deletions(-)". In fact, the performance regression can be reproduced
easily with a few lines of C code: see attached patches. You only have
to add some *unused* (dead) code to see a "glitch" in performance.
It's even worse: the performance change depends on the size of unused
code.
I done my best to isolate the microbenchmark to make it as reliable as
possible. Results of bm_call_simple on my desktop PC:
(a) Reference:
Average: 1201.0 ms +/- 0.2 ms (min: 1200.7 ms, max: 1201.2 ms)
(b) Add 2 unused functions, based on (a):
Average: 1273.0 ms +/- 1.8 ms (min: 1270.1 ms, max: 1274.4 ms)
(c) Add 1 unused short function ("return NULL;"), based on (a):
Average: 1169.6 ms +/- 0.2 ms (min: 1169.3 ms, max: 1169.8 ms)
(b) and (c) are 2 versions only adding unused code to (a). The
difference between (b) and (c) is the size of unused code. The problem
is that (b) makes the code slower and (c) makes the code faster (!),
whereas I would not expect any performance change.
A sane person should ignore such minor performance delta (+72 ms = +6%
// -31.4 ms = -3%). Right. But for optimization patches on CPython,
we use the CPython benchmark suite as a proof that yeah, the change
really makes CPython faster, as announced.
I compiled the C code using GCC (5.3) and Clang (3.7) using various
options: -O0, -O3, -fno-align-functions, -falign-functions=N (with
N=1, 2, 6, 12), -fomit-frame-pointer, -flto, etc. In short, the
performance looks "random". I'm unable to correlate the performance
with any Linux perf event. IMHO the performance depends on something
low level like L1 cache, CPU pipeline, branch prediction, etc. As I
wrote, I'm unable to verify that.
To reproduce my issue, you can use the following commands:
---------------------------
hg clone https://hg.python.org/cpython fastcall
# or: "hg clone (...)/cpython fastcall"
# if you already have a local copy of cpython ;-)
cd fastcall
./configure -C
# build reference binary
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
make && mv python python-ref
# build binary with deadcode 1
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
patch -p1 < deadcode1.patch
make && mv python python-deadcode1
# build binary with deadcode 2
hg up -C -r 496e094f4734
patch -p1 < prepare.patch
patch -p1 < deadcode2.patch
make && mv python python-deadcode2
# run benchmark
PYTHONHASHSEED=0 ./python-ref bm_call_simple.py
PYTHONHASHSEED=0 ./python-deadcode1 bm_call_simple.py
PYTHONHASHSEED=0 ./python-deadcode2 bm_call_simple.py
---------------------------
It suggest you to isolate at least one CPU and run the benchmark on
isolated CPUs to get reliable timings:
---------------------------
# run benchmark on the CPU #2
PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
---------------------------
My notes on CPU isolation:
http://haypo-notes.readthedocs.org/microbenchmark.html
If you don't want to try CPU isolation, try to get an idle system
and/or run the benchmark many times until the standard deviation (the
"+/- ..." part) looks small enough...
Don't try to run the microbenchmark without PYTHONHASHSEED=0 or you
will get random results depending on the secret hash key used by the
randomized hash function. (Or modify the code to spawn enough child
process to get an uniform distribution ;-))
I don't expect that you get the same numbers than me. For example, on
my laptop, the delta is very small (+/- 1%):
$ PYTHONHASHSEED=0 taskset -c 2 ./python-ref bm_call_simple.py
Average: 1096.1 ms +/- 12.9 ms (min: 1079.5 ms, max: 1110.3 ms)
$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode1 bm_call_simple.py
Average: 1109.2 ms +/- 11.1 ms (min: 1095.8 ms, max: 1122.9 ms)
=> +1% (+13 ms)
$ PYTHONHASHSEED=0 taskset -c 2 ./python-deadcode2 bm_call_simple.py
Average: 1072.0 ms +/- 1.5 ms (min: 1070.0 ms, max: 1073.9 ms)
=> -2% (-24 ms)
CPU of my desktop: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz - 4
physical cores with hyper-threading
CPU of my laptop: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz - 2
physical cores with hyper-threading
I modified bm_call_simple.py to call foo() 100 times rather than 20 in
the loop to see the issue more easily. I also removed dependencies and
changed the output format to display average, standard deviation,
minimum and maximum.
For more benchmarks, see attached deadcode1.log and deadcode2.log:
results of the CPython benchmark to compare deadcode1 VS reference,
and deadcode2 VS reference run on my desktop PC (perf.py --fast & CPU
isolation). Again, deadcode1 looks slower in most cases, whereas
deadcode2 looks faster in most cases, whereas the difference is still
dead code...
Victor, disappointed
On Tue, 26 Apr 2016 18:28:32 +0200
Maciej Fijalkowski <fijall(a)gmail.com>
wrote:
>
> taking the minimum is a terrible idea anyway, none of the statistical
> discussion makes sense if you do that
The minimum is a reasonable metric for quick throwaway benchmarks as
timeit is designed for, as it has a better hope of alleviating the
impact of system load (as such throwaway benchmarks are often run on
the developer's workstation).
For a persistent benchmarks suite, where we can afford longer
benchmark runtimes and are able to keep system noise to a minimum, we
might prefer another metric.
Regards
Antoine.
On Tue, Apr 26, 2016 at 11:46 AM, Victor Stinner
<victor.stinner(a)gmail.com> wrote:
> Hi,
>
> 2016-04-26 10:56 GMT+02:00 Armin Rigo <arigo(a)tunes.org>:
>> Hi,
>>
>> On 25 April 2016 at 08:25, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
>>> The problem with disabled ASLR is that you change the measurment from
>>> a statistical distribution, to one draw from a statistical
>>> distribution repeatedly. There is no going around doing multiple runs
>>> and doing an average on that.
>>
>> You should mention that it is usually enough to do the following:
>> instead of running once with PYTHONHASHSEED=0, run five or ten times
>> with PYTHONHASHSEED in range(5 or 10). In this way, you get all
>> benefits: not-too-long benchmarking, no randomness, but still some
>> statistically relevant sampling.
>
> I guess that the number of required runs to get a nice distribution
> depends on the size of the largest dictionary in the benchmark. I
> mean, the dictionaries that matter in performance.
>
> The best would be to handle this transparently in perf.py. Either
> disable all source of randomness, or run mutliple processes to have an
> uniform distribution, rather than on only having one sample for one
> specific config. Maybe it could be an option: by default, run multiple
> processes, but have an option to only run one process using
> PYTHONHASHSEED=0.
>
> By the way, timeit has a very similar issue. I'm quite sure that most
> Python developers run "python -m timeit ..." at least 3 times and take
> the minimum. "python -m timeit" could maybe be modified to also spawn
> child processes to get a better distribution, and maybe also modified
> to display the minimum, the average and the standard deviation? (not
> only the minimum)
taking the minimum is a terrible idea anyway, none of the statistical
discussion makes sense if you do that
>
> Well, the question is also if it's a good thing to have such really
> tiny microbenchmark like bm_call_simple in the Python benchmark suite.
> I spend 2 or 3 days to analyze CPython running bm_call_simple with
> Linux perf tool, callgrind and cachegrind. I'm still unable to
> understand the link between my changes on the C code and the result.
> IMHO this specific benchmark depends on very low-level things like the
> CPU L1 cache. Maybe bm_call_simple helps in some very specific use
> cases, like trying to make Python function calls faster. But in other
> cases, it can be a source of noise, confusion and frustration...
>
> Victor
maybe it's just a terrible benchmark (it surely is for pypy for example)
Hi,
2016-04-26 11:01 GMT+02:00 Antonio Cuni <anto.cuni(a)gmail.com>:
> On Mon, Apr 25, 2016 at 12:49 AM, Victor Stinner <victor.stinner(a)gmail.com>
> wrote:
>> Last months, I spent a lot of time on microbenchmarks. Probably too
>> much time :-) I found a great Linux config to get a much more stable
>> system to get reliable microbenchmarks:
>> https://haypo-notes.readthedocs.org/microbenchmark.html
>>
>> * isolate some CPU cores
>
> you might be interested in cpusets and the cset utility: in theory, they
> allow you to isolate one CPU without having to reboot to change the kernel
> parameters:
>
> http://skebanga.blogspot.it/2012/06/cset-shield-easily-configure-cpusets....
> https://github.com/lpechacek/cpuset
Ah, I didn't know this tool. Basically, it looks similar to the Linux
isolcpus command line parameter, but done in userpace. I see an
advantage, it can be used temporary without having to reboot the
kernel.
> However, I never did a scientific comparison between cpusets and isolcpu to
> see if the former behaves exactly like the latter.
I have a simple test:
* run a benchmark when the system is idle
* run a benchmark when the system is *very* busy (ex: system load > 5)
Using CPU isolation + nohz_full + blocking IRQ on isolated CPUs, the
benchmark result is the *same* in two cases. Try on a Linux without
any specific config to see a huge difference. For example, performance
divided by two.
I'm using CPU isolation to be able to run benchmarks while I'm still
working on my PC: use firefox, thunderbird, run heavy unit tests,
compile C code, etc.
Right code, I dedicated 2 physical cores to benchmarks and kept 2
physical cores for regular work. Maybe it's too much. It looks like
almost all benchmarks only use logical core in practice (whereas 2
physical cores give me 4 logical cores). Next time I will probably
only dedicate 1 physical core. The advantage of having two dedicated
physical cores is to be able to run two "isolated" benchmarks in
parallel ;-)
I wrote a simple tool to get a system load larger than a minimum:
https://bitbucket.org/haypo/misc/src/tip/bin/system_load.py
I also started to write a script to configure a system for CPU isolation:
https://bitbucket.org/haypo/misc/src/tip/bin/isolcpus.py
* Block IRQ on isolated CPu cores
* Disable ASLR
* Force performance CPU speed on isolated cores, but not on other
cores. I don't want to burn my PC :-) Intel P-state is still enabled
on all CPU cores, so the power state of isolated cores still change
dynamically in practice. You can see it using powertop for example.
CPU isolation is not perfect, you still have random source of noises.
There are also System Management Interrupt (SMI) and other low-level
things. I hope that running multiple iterations of the benchmark is be
enough to reduce (or remove) other sources of noise.
By the way, search "Linux realtime" to find good information about
"sources of noise" on Linux. Example:
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hardware
Hopefully, my requirements on timing are more cool than hard realtime ;-)
Victor
Hi,
2016-04-26 10:56 GMT+02:00 Armin Rigo <arigo(a)tunes.org>:
> Hi,
>
> On 25 April 2016 at 08:25, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
>> The problem with disabled ASLR is that you change the measurment from
>> a statistical distribution, to one draw from a statistical
>> distribution repeatedly. There is no going around doing multiple runs
>> and doing an average on that.
>
> You should mention that it is usually enough to do the following:
> instead of running once with PYTHONHASHSEED=0, run five or ten times
> with PYTHONHASHSEED in range(5 or 10). In this way, you get all
> benefits: not-too-long benchmarking, no randomness, but still some
> statistically relevant sampling.
I guess that the number of required runs to get a nice distribution
depends on the size of the largest dictionary in the benchmark. I
mean, the dictionaries that matter in performance.
The best would be to handle this transparently in perf.py. Either
disable all source of randomness, or run mutliple processes to have an
uniform distribution, rather than on only having one sample for one
specific config. Maybe it could be an option: by default, run multiple
processes, but have an option to only run one process using
PYTHONHASHSEED=0.
By the way, timeit has a very similar issue. I'm quite sure that most
Python developers run "python -m timeit ..." at least 3 times and take
the minimum. "python -m timeit" could maybe be modified to also spawn
child processes to get a better distribution, and maybe also modified
to display the minimum, the average and the standard deviation? (not
only the minimum)
Well, the question is also if it's a good thing to have such really
tiny microbenchmark like bm_call_simple in the Python benchmark suite.
I spend 2 or 3 days to analyze CPython running bm_call_simple with
Linux perf tool, callgrind and cachegrind. I'm still unable to
understand the link between my changes on the C code and the result.
IMHO this specific benchmark depends on very low-level things like the
CPU L1 cache. Maybe bm_call_simple helps in some very specific use
cases, like trying to make Python function calls faster. But in other
cases, it can be a source of noise, confusion and frustration...
Victor
Hi Armin,
On Tue, Apr 26, 2016 at 10:56 AM, Armin Rigo <arigo(a)tunes.org> wrote:
> Hi,
>
> On 25 April 2016 at 08:25, Maciej Fijalkowski <fijall(a)gmail.com> wrote:
> > The problem with disabled ASLR is that you change the measurment from
> > a statistical distribution, to one draw from a statistical
> > distribution repeatedly. There is no going around doing multiple runs
> > and doing an average on that.
>
> You should mention that it is usually enough to do the following:
> instead of running once with PYTHONHASHSEED=0, run five or ten times
> with PYTHONHASHSEED in range(5 or 10). In this way, you get all
> benefits: not-too-long benchmarking, no randomness, but still some
> statistically relevant sampling.
>
note that here there are two sources of "randomness": one is
PYTHONHASHSEED (which you can control with the env variable), the other is
ASLR which, AFAIK, you cannot control in the same fine way: you can only
either enable or disable it.