Benchmarking "fun" (was Re: Python 2.1 slower than 2.0)
In the interest of generating some numbers (and filling up my hard drive), last night I wrote a script to build lots & lots of versions of python (many of which turned out to be redundant - eg. -O6 didn't seem to do anything different to -O3 and pybench doesn't work with 1.5.2), and then run pybench with them. Summarised results below; first a key: src-n: this morning's CVS (with Jeremy's f_localsplus optimisation) (only built this with -O3) src: CVS from yesterday afternoon src-obmalloc: CVS from yesterday afternoon with Vladimir's obmalloc patch applied. More on this later... Python-2.0: you can guess what this is. All runs are compared against Python-2.0-O2: Benchmark: src-n-O3 (rounds=10, warp=20) Average round time: 49029.00 ms -0.86% Benchmark: src (rounds=10, warp=20) Average round time: 67141.00 ms +35.76% Benchmark: src-O (rounds=10, warp=20) Average round time: 50167.00 ms +1.44% Benchmark: src-O2 (rounds=10, warp=20) Average round time: 49641.00 ms +0.37% Benchmark: src-O3 (rounds=10, warp=20) Average round time: 49104.00 ms -0.71% Benchmark: src-O6 (rounds=10, warp=20) Average round time: 49131.00 ms -0.66% Benchmark: src-obmalloc (rounds=10, warp=20) Average round time: 63276.00 ms +27.94% Benchmark: src-obmalloc-O (rounds=10, warp=20) Average round time: 46927.00 ms -5.11% Benchmark: src-obmalloc-O2 (rounds=10, warp=20) Average round time: 46146.00 ms -6.69% Benchmark: src-obmalloc-O3 (rounds=10, warp=20) Average round time: 46456.00 ms -6.07% Benchmark: src-obmalloc-O6 (rounds=10, warp=20) Average round time: 46450.00 ms -6.08% Benchmark: Python-2.0 (rounds=10, warp=20) Average round time: 68933.00 ms +39.38% Benchmark: Python-2.0-O (rounds=10, warp=20) Average round time: 49542.00 ms +0.17% Benchmark: Python-2.0-O3 (rounds=10, warp=20) Average round time: 48262.00 ms -2.41% Benchmark: Python-2.0-O6 (rounds=10, warp=20) Average round time: 48273.00 ms -2.39% My conclusion? Python 2.1 is slower than Python 2.0, but not by enough to care about. Interestingly, adding obmalloc speeds things up. Let's take a closer look: $ python pybench.py -c src-obmalloc-O3 -s src-O3 PYBENCH 0.7 Benchmark: src-O3 (rounds=10, warp=20) Tests: per run per oper. diff * ------------------------------------------------------------------------ BuiltinFunctionCalls: 843.35 ms 6.61 us +2.93% BuiltinMethodLookup: 878.70 ms 1.67 us +0.56% ConcatStrings: 1068.80 ms 7.13 us -1.22% ConcatUnicode: 1373.70 ms 9.16 us -1.24% CreateInstances: 1433.55 ms 34.13 us +9.06% CreateStringsWithConcat: 1031.75 ms 5.16 us +10.95% CreateUnicodeWithConcat: 1277.85 ms 6.39 us +3.14% DictCreation: 1275.80 ms 8.51 us +44.22% ForLoops: 1415.90 ms 141.59 us -0.64% IfThenElse: 1152.70 ms 1.71 us -0.15% ListSlicing: 397.40 ms 113.54 us -0.53% NestedForLoops: 789.75 ms 2.26 us -0.37% NormalClassAttribute: 935.15 ms 1.56 us -0.41% NormalInstanceAttribute: 961.15 ms 1.60 us -0.60% PythonFunctionCalls: 1079.65 ms 6.54 us -1.00% PythonMethodCalls: 908.05 ms 12.11 us -0.88% Recursion: 838.50 ms 67.08 us -0.00% SecondImport: 741.20 ms 29.65 us +25.57% SecondPackageImport: 744.25 ms 29.77 us +18.66% SecondSubmoduleImport: 947.05 ms 37.88 us +25.60% SimpleComplexArithmetic: 1129.40 ms 5.13 us +114.92% SimpleDictManipulation: 1048.55 ms 3.50 us -0.00% SimpleFloatArithmetic: 746.05 ms 1.36 us -2.75% SimpleIntFloatArithmetic: 823.35 ms 1.25 us -0.37% SimpleIntegerArithmetic: 823.40 ms 1.25 us -0.37% SimpleListManipulation: 1004.70 ms 3.72 us +0.01% SimpleLongArithmetic: 865.30 ms 5.24 us +100.65% SmallLists: 1657.65 ms 6.50 us +6.63% SmallTuples: 1143.95 ms 4.77 us +2.90% SpecialClassAttribute: 949.00 ms 1.58 us -0.22% SpecialInstanceAttribute: 1353.05 ms 2.26 us -0.73% StringMappings: 1161.00 ms 9.21 us +7.30% StringPredicates: 1069.65 ms 3.82 us -5.30% StringSlicing: 846.30 ms 4.84 us +8.61% TryExcept: 1590.40 ms 1.06 us -0.49% TryRaiseExcept: 1104.65 ms 73.64 us +24.46% TupleSlicing: 681.10 ms 6.49 us -3.13% UnicodeMappings: 1021.70 ms 56.76 us +0.79% UnicodePredicates: 1308.45 ms 5.82 us -4.79% UnicodeProperties: 1148.45 ms 5.74 us +13.67% UnicodeSlicing: 984.15 ms 5.62 us -0.51% ------------------------------------------------------------------------ Average round time: 49104.00 ms +5.70% *) measured against: src-obmalloc-O3 (rounds=10, warp=20) Words fail me slightly, but maybe some tuning of the memory allocation of longs & complex numbers would be in order? Time for lectures - I don't think algebraic geometry is going to make my head hurt as much as trying to explain benchmarks... Cheers, M. -- ARTHUR: But which is probably incapable of drinking the coffee. -- The Hitch-Hikers Guide to the Galaxy, Episode 6
Michael Hudson wrote:
In the interest of generating some numbers (and filling up my hard drive), last night I wrote a script to build lots & lots of versions of python (many of which turned out to be redundant - eg. -O6 didn't seem to do anything different to -O3 and pybench doesn't work with 1.5.2), and then run pybench with them. Summarised results below; first a key:
src-n: this morning's CVS (with Jeremy's f_localsplus optimisation) (only built this with -O3) src: CVS from yesterday afternoon src-obmalloc: CVS from yesterday afternoon with Vladimir's obmalloc patch applied. More on this later... Python-2.0: you can guess what this is.
All runs are compared against Python-2.0-O2:
Benchmark: src-n-O3 (rounds=10, warp=20) Average round time: 49029.00 ms -0.86% Benchmark: src (rounds=10, warp=20) Average round time: 67141.00 ms +35.76% Benchmark: src-O (rounds=10, warp=20) Average round time: 50167.00 ms +1.44% Benchmark: src-O2 (rounds=10, warp=20) Average round time: 49641.00 ms +0.37% Benchmark: src-O3 (rounds=10, warp=20) Average round time: 49104.00 ms -0.71% Benchmark: src-O6 (rounds=10, warp=20) Average round time: 49131.00 ms -0.66% Benchmark: src-obmalloc (rounds=10, warp=20) Average round time: 63276.00 ms +27.94% Benchmark: src-obmalloc-O (rounds=10, warp=20) Average round time: 46927.00 ms -5.11% Benchmark: src-obmalloc-O2 (rounds=10, warp=20) Average round time: 46146.00 ms -6.69% Benchmark: src-obmalloc-O3 (rounds=10, warp=20) Average round time: 46456.00 ms -6.07% Benchmark: src-obmalloc-O6 (rounds=10, warp=20) Average round time: 46450.00 ms -6.08% Benchmark: Python-2.0 (rounds=10, warp=20) Average round time: 68933.00 ms +39.38% Benchmark: Python-2.0-O (rounds=10, warp=20) Average round time: 49542.00 ms +0.17% Benchmark: Python-2.0-O3 (rounds=10, warp=20) Average round time: 48262.00 ms -2.41% Benchmark: Python-2.0-O6 (rounds=10, warp=20) Average round time: 48273.00 ms -2.39%
My conclusion? Python 2.1 is slower than Python 2.0, but not by enough to care about.
What compiler did you use and on which platform ? I have made similar experience with -On with n>3 compared to -O2 using pgcc (gcc optimized for PC processors). BTW, the Linux kernel uses "-Wall -Wstrict-prototypes -O3 -fomit-frame-pointer" as CFLAGS -- perhaps Python should too on Linux ?! Does anybody know about the effect of -fomit-frame-pointer ? Would it cause problems or produce code which is not compatible with code compiled without this flag ?
Interestingly, adding obmalloc speeds things up. Let's take a closer look:
$ python pybench.py -c src-obmalloc-O3 -s src-O3 PYBENCH 0.7
Benchmark: src-O3 (rounds=10, warp=20)
Tests: per run per oper. diff * ------------------------------------------------------------------------ BuiltinFunctionCalls: 843.35 ms 6.61 us +2.93% BuiltinMethodLookup: 878.70 ms 1.67 us +0.56% ConcatStrings: 1068.80 ms 7.13 us -1.22% ConcatUnicode: 1373.70 ms 9.16 us -1.24% CreateInstances: 1433.55 ms 34.13 us +9.06% CreateStringsWithConcat: 1031.75 ms 5.16 us +10.95% CreateUnicodeWithConcat: 1277.85 ms 6.39 us +3.14% DictCreation: 1275.80 ms 8.51 us +44.22% ForLoops: 1415.90 ms 141.59 us -0.64% IfThenElse: 1152.70 ms 1.71 us -0.15% ListSlicing: 397.40 ms 113.54 us -0.53% NestedForLoops: 789.75 ms 2.26 us -0.37% NormalClassAttribute: 935.15 ms 1.56 us -0.41% NormalInstanceAttribute: 961.15 ms 1.60 us -0.60% PythonFunctionCalls: 1079.65 ms 6.54 us -1.00% PythonMethodCalls: 908.05 ms 12.11 us -0.88% Recursion: 838.50 ms 67.08 us -0.00% SecondImport: 741.20 ms 29.65 us +25.57% SecondPackageImport: 744.25 ms 29.77 us +18.66% SecondSubmoduleImport: 947.05 ms 37.88 us +25.60% SimpleComplexArithmetic: 1129.40 ms 5.13 us +114.92% SimpleDictManipulation: 1048.55 ms 3.50 us -0.00% SimpleFloatArithmetic: 746.05 ms 1.36 us -2.75% SimpleIntFloatArithmetic: 823.35 ms 1.25 us -0.37% SimpleIntegerArithmetic: 823.40 ms 1.25 us -0.37% SimpleListManipulation: 1004.70 ms 3.72 us +0.01% SimpleLongArithmetic: 865.30 ms 5.24 us +100.65% SmallLists: 1657.65 ms 6.50 us +6.63% SmallTuples: 1143.95 ms 4.77 us +2.90% SpecialClassAttribute: 949.00 ms 1.58 us -0.22% SpecialInstanceAttribute: 1353.05 ms 2.26 us -0.73% StringMappings: 1161.00 ms 9.21 us +7.30% StringPredicates: 1069.65 ms 3.82 us -5.30% StringSlicing: 846.30 ms 4.84 us +8.61% TryExcept: 1590.40 ms 1.06 us -0.49% TryRaiseExcept: 1104.65 ms 73.64 us +24.46% TupleSlicing: 681.10 ms 6.49 us -3.13% UnicodeMappings: 1021.70 ms 56.76 us +0.79% UnicodePredicates: 1308.45 ms 5.82 us -4.79% UnicodeProperties: 1148.45 ms 5.74 us +13.67% UnicodeSlicing: 984.15 ms 5.62 us -0.51% ------------------------------------------------------------------------ Average round time: 49104.00 ms +5.70%
*) measured against: src-obmalloc-O3 (rounds=10, warp=20)
Words fail me slightly, but maybe some tuning of the memory allocation of longs & complex numbers would be in order?
AFAIR, Vladimir's malloc implementation favours small objects. All number objects (except longs) fall into this category. Perhaps we should think about adding his lib to the core ?! -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
On Wed, Jan 31, 2001 at 03:34:19PM +0100, M.-A. Lemburg wrote:
I have made similar experience with -On with n>3 compared to -O2 using pgcc (gcc optimized for PC processors). BTW, the Linux kernel uses "-Wall -Wstrict-prototypes -O3 -fomit-frame-pointer" as CFLAGS -- perhaps Python should too on Linux ?!
Maybe, but the Linux kernel can be quite specific in what version of gcc you need, and knows in advance on what platform you are using it :) The stability and actual speedup of gcc's optimization options can and does vary across platforms. In the above example, -Wall and -Wstrict-prototypes are just warnings, and -O3 is the same as "-O2 -finline-functions". As for -fomit-frame-pointer....
Does anybody know about the effect of -fomit-frame-pointer ? Would it cause problems or produce code which is not compatible with code compiled without this flag ?
The effect of -fomit-frame-pointer is that the compilation of frame-pointer handling code is avoided. It doesn't have any effect on compatibility, since it doesn't matter that other parts/functions/libraries do have such code, but it does make debugging impossible (on most machines, in any case.) From GCC's info docs: -fomit-frame-pointer' Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. *It also makes debugging impossible on some machines.* On some machines, such as the Vax, this flag has no effect, because the standard calling sequence automatically handles the frame pointer and nothing is saved by pretending it doesn't exist. The machine-description macro RAME_POINTER_REQUIRED' controls whether a target machine supports this flag. *Note Registers::. Obviously, for the Linux kernel this is a very good thing, you don't debug the Linux kernel like a normal program anyway (contrary to some other UNIX kernels, I might add.) I believe -g turns off -fomit-frame-pointer itself, but the docs for -g or -fomit-frame-pointer don't mention it. One other thing I noted in the gcc docs is that gcc doesn't do loop unrolling even with -O3, though I thought it would at -O2. You need to add -funroll-loop to enable loop unrolling, and that might squeeze out some more performance.. This only works for loops with a fixed repetition, though, so I'm not sure if it matters. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Thomas Wouters wrote:
On Wed, Jan 31, 2001 at 03:34:19PM +0100, M.-A. Lemburg wrote:
I have made similar experience with -On with n>3 compared to -O2 using pgcc (gcc optimized for PC processors). BTW, the Linux kernel uses "-Wall -Wstrict-prototypes -O3 -fomit-frame-pointer" as CFLAGS -- perhaps Python should too on Linux ?!
[...lots of useful tips about gcc compiler options...]
Thanks for the useful details, Thomas. I guess on PC machines, -fomit-frame-pointer does have some use due to the restricted number of available registers. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[M.-A. Lemburg]
AFAIR, Vladimir's malloc implementation favours small objects.
It favors the memory alloc/dealloc patterns Vlad recorded while running an instrumented Python. Which is mostly good news. The flip side is that it favors the specific programs he ran, and who knows whether those are "typical". OTOH, vendor mallocs favor the programs *they* ran, which probably didn't include Python at all <wink>.
... Perhaps we should think about adding his lib to the core ?!
It's patch 101104 on SF. I pushed Vlad to push this for 2.0, but he wisely decided it was too big a change at the time. It's certainly too much a change to slam into 2.1 at this late stage too. There are many reasons to want this (e.g., list.append() calls realloc every time today, because, despite over-allocating, it has no idea how much storage *has* already been allocated; any malloc has to know this info under the covers, but there's no way for us to know that too unless we add another N bytes to every list object to record it, or use our own malloc which *can* tell us that info). list.append()-behavior-varies-wildly-across-platforms-today- when-the-list-gets-large-because-of-that-ly y'rs - tim
In the interest of generating some numbers (and filling up my hard drive), last night I wrote a script to build lots & lots of versions of python (many of which turned out to be redundant - eg. -O6 didn't seem to do anything different to -O3 and pybench doesn't work with 1.5.2), and then run pybench with them.
FYI, I've just updated the archive to also work under Python 1.5.x: http://www.lemburg.com/python/pybench-0.7.zip -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (4)
-
M.-A. Lemburg
-
Michael Hudson
-
Thomas Wouters
-
Tim Peters