Re: Python 2.1 slower than 2.0
On Sat, 27 Jan 2001 18:28:02 +0100, Andreas Jung <andreas@andreas-jung.com> wrote:
Is there a reason why 2.1 runs significantly slower ? Both Python versions were compiled with -g -O2 only.
[CC'ing to python-dev] Confirmed: [amk@mira Python-2.0]$ ./python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.14 This machine benchmarks at 3184.71 pystones/second [amk@mira Python-2.0]$ python2.1 Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.81 This machine benchmarks at 2624.67 pystones/second The ceval.c changes seem a likely candidate to have caused this. Anyone want to run Marc-Andre's microbenchmarks and see how the numbers have changed? --amk
[A.M. Kuchling]
[CC'ing to python-dev] Confirmed:
[amk@mira Python-2.0]$ ./python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.14 This machine benchmarks at 3184.71 pystones/second [amk@mira Python-2.0]$ python2.1 Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.81 This machine benchmarks at 2624.67 pystones/second
The ceval.c changes seem a likely candidate to have caused this. Anyone want to run Marc-Andre's microbenchmarks and see how the numbers have changed?
Want to, yes, but it looks hopeless on my box: **** 2.0 C:\Python20>python lib/test/pystone.py Pystone(1.1) time for 10000 passes = 0.851013 This machine benchmarks at 11750.7 pystones/second C:\Python20>python lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.24279 This machine benchmarks at 8046.41 pystones/second **** 2.1a1 C:\Python21a1>python lib/test/pystone.py Pystone(1.1) time for 10000 passes = 0.823313 This machine benchmarks at 12146 pystones/second C:\Python21a1>python lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.27046 This machine benchmarks at 7871.15 pystones/second **** CVS C:\Code\python\dist\src\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.836391 This machine benchmarks at 11956.1 pystones/second C:\Code\python\dist\src\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 1.3055 This machine benchmarks at 7659.9 pystones/second That's after a reboot: no matter which Python I use, it gets about 12000 on the first run with a given python.exe, and about 8000 on the second. Not shown is that it *stays* at about 8000 until the next reboot. So there's a Windows (W98SE) Mystery, but also no evidence that timings have changed worth spit under the MS compiler. The eval loop is very touchy, and I suspect you won't track this down on your box until staring at the code gcc (I presume you're using gcc) generates. May be sensitive to which release of gcc you're using too. switch-to-windows-and-you'll-have-easier-things-to-worry-about<wink>-ly y'rs - tim
Here the results of my micro benckmark pybench 0.7: PYBENCH 0.7 Benchmark: /home/lemburg/tmp/pybench-2.1a1.pyb (rounds=10, warp=20) Tests: per run per oper. diff * ------------------------------------------------------------------------ BuiltinFunctionCalls: 1102.30 ms 8.65 us +7.56% BuiltinMethodLookup: 966.75 ms 1.84 us +4.56% ConcatStrings: 1198.55 ms 7.99 us +11.63% ConcatUnicode: 1835.60 ms 12.24 us +19.29% CreateInstances: 1556.40 ms 37.06 us +2.49% CreateStringsWithConcat: 1396.70 ms 6.98 us +5.44% CreateUnicodeWithConcat: 1895.80 ms 9.48 us +31.61% DictCreation: 1760.50 ms 11.74 us +2.43% ForLoops: 1426.90 ms 142.69 us -7.51% IfThenElse: 1155.25 ms 1.71 us -6.24% ListSlicing: 555.40 ms 158.69 us -4.14% NestedForLoops: 784.55 ms 2.24 us -6.33% NormalClassAttribute: 1052.80 ms 1.75 us -10.42% NormalInstanceAttribute: 1053.80 ms 1.76 us +0.89% PythonFunctionCalls: 1127.50 ms 6.83 us +12.56% PythonMethodCalls: 909.10 ms 12.12 us +9.70% Recursion: 942.40 ms 75.39 us +23.74% SecondImport: 924.20 ms 36.97 us +3.98% SecondPackageImport: 951.10 ms 38.04 us +6.16% SecondSubmoduleImport: 1211.30 ms 48.45 us +7.69% SimpleComplexArithmetic: 1635.30 ms 7.43 us +5.58% SimpleDictManipulation: 963.35 ms 3.21 us -0.57% SimpleFloatArithmetic: 877.00 ms 1.59 us -2.92% SimpleIntFloatArithmetic: 851.10 ms 1.29 us -5.89% SimpleIntegerArithmetic: 850.05 ms 1.29 us -6.41% SimpleListManipulation: 1168.50 ms 4.33 us +8.14% SimpleLongArithmetic: 1231.15 ms 7.46 us +1.52% SmallLists: 2153.35 ms 8.44 us +10.77% SmallTuples: 1314.65 ms 5.48 us +3.80% SpecialClassAttribute: 1050.80 ms 1.75 us +1.48% SpecialInstanceAttribute: 1248.75 ms 2.08 us -2.32% StringMappings: 1702.60 ms 13.51 us +19.69% StringPredicates: 1024.25 ms 3.66 us -25.49% StringSlicing: 1093.35 ms 6.25 us +4.35% TryExcept: 1584.85 ms 1.06 us -10.90% TryRaiseExcept: 1239.50 ms 82.63 us +4.64% TupleSlicing: 983.00 ms 9.36 us +3.36% UnicodeMappings: 1631.65 ms 90.65 us +42.76% UnicodePredicates: 1762.10 ms 7.83 us +15.99% UnicodeProperties: 1410.80 ms 7.05 us +19.57% UnicodeSlicing: 1366.20 ms 7.81 us +19.23% ------------------------------------------------------------------------ Average round time: 58001.00 ms +3.30% *) measured against: /home/lemburg/tmp/pybench-2.0.pyb (rounds=10, warp=20) The benchmark is available here in case someone wants to verify the results on different platforms: http://www.lemburg.com/python/pybench-0.7.zip The above tests were done on a Linux 2.2 system, AMD K6 233MHz. The figures shown compare CVS Python (2.1a1) against stock Python 2.0. As you can see, Python function calls have suffered a lot for some reason. Unicode mappings and other Unicode database related methods show the effect of the compression of the Unicode database -- a clear space/speed tradeoff. I can't really explain why Unicode concatenation has had a slowdown -- perhaps the new coercion logic has something to do with this ?! On the nice side: attribute lookups are faster; probably due to the string key optimizations in the dictionary implementation. Loops and exceptions are also a tad faster. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
mal wrote:
UnicodeMappings: 1631.65 ms 90.65 us +42.76% UnicodePredicates: 1762.10 ms 7.83 us +15.99% UnicodeProperties: 1410.80 ms 7.05 us +19.57% UnicodeSlicing: 1366.20 ms 7.81 us +19.23%
Unicode mappings and other Unicode database related methods show the effect of the compression of the Unicode database -- a clear space/speed tradeoff.
umm. the tests don't seem to test the "\N{name}" escapes, so the only thing that has changed in 2.1 is the "decomposition" method (used in the UnicodeProperties test). are you sure you're comparing against 2.0 final? Cheers /F
Fredrik Lundh wrote:
mal wrote:
UnicodeMappings: 1631.65 ms 90.65 us +42.76% UnicodePredicates: 1762.10 ms 7.83 us +15.99% UnicodeProperties: 1410.80 ms 7.05 us +19.57% UnicodeSlicing: 1366.20 ms 7.81 us +19.23%
Unicode mappings and other Unicode database related methods show the effect of the compression of the Unicode database -- a clear space/speed tradeoff.
umm. the tests don't seem to test the "\N{name}" escapes, so the only thing that has changed in 2.1 is the "decomposition" method (used in the UnicodeProperties test).
The mappings figure surprised me too: the code has not changed, but the unicodetype_db.h look different. Don't know how this affects performance though. The differences could also be explained by a increase in Unicode object creation time (the concatenation is also a lot slower), so perhaps that's where we should look...
are you sure you're comparing against 2.0 final?
Yes... after a check of the Makefile I found that I had compiled Python 2.0 with -O3 and 2.1a1 with -O2 -- perhaps this makes a difference w/r to inlining of code. I'll recompile and rerun the benchmark. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
Fredrik Lundh wrote:
mal wrote:
UnicodeMappings: 1631.65 ms 90.65 us +42.76% UnicodePredicates: 1762.10 ms 7.83 us +15.99% UnicodeProperties: 1410.80 ms 7.05 us +19.57% UnicodeSlicing: 1366.20 ms 7.81 us +19.23%
Unicode mappings and other Unicode database related methods show the effect of the compression of the Unicode database -- a clear space/speed tradeoff.
umm. the tests don't seem to test the "\N{name}" escapes, so the only thing that has changed in 2.1 is the "decomposition" method (used in the UnicodeProperties test).
The mappings figure surprised me too: the code has not changed, but the unicodetype_db.h look different. Don't know how this affects performance though.
The differences could also be explained by a increase in Unicode object creation time (the concatenation is also a lot slower), so perhaps that's where we should look...
are you sure you're comparing against 2.0 final?
Yes... after a check of the Makefile I found that I had compiled Python 2.0 with -O3 and 2.1a1 with -O2 -- perhaps this makes a difference w/r to inlining of code. I'll recompile and rerun the benchmark.
Looks like there is an effect of choosing -O3 over -O2 (even though not necessarily positive all the way); what results do you get on Windows ? -- PYBENCH 0.7 Benchmark: /home/lemburg/tmp/pybench-2.1a1.pyb (rounds=10, warp=20) Tests: per run per oper. diff * ------------------------------------------------------------------------ BuiltinFunctionCalls: 1065.10 ms 8.35 us +3.93% BuiltinMethodLookup: 1286.30 ms 2.45 us +39.12% ConcatStrings: 1243.30 ms 8.29 us +15.80% ConcatUnicode: 1449.10 ms 9.66 us -5.83% CreateInstances: 1639.25 ms 39.03 us +7.95% CreateStringsWithConcat: 1453.45 ms 7.27 us +9.73% CreateUnicodeWithConcat: 1558.45 ms 7.79 us +8.19% DictCreation: 1869.35 ms 12.46 us +8.77% ForLoops: 1526.85 ms 152.69 us -1.03% IfThenElse: 1381.00 ms 2.05 us +12.09% ListSlicing: 547.40 ms 156.40 us -5.52% NestedForLoops: 824.50 ms 2.36 us -1.56% NormalClassAttribute: 1233.55 ms 2.06 us +4.96% NormalInstanceAttribute: 1215.50 ms 2.03 us +16.37% PythonFunctionCalls: 1107.30 ms 6.71 us +10.55% PythonMethodCalls: 1047.00 ms 13.96 us +26.34% Recursion: 940.35 ms 75.23 us +23.47% SecondImport: 894.05 ms 35.76 us +0.59% SecondPackageImport: 915.05 ms 36.60 us +2.14% SecondSubmoduleImport: 1131.10 ms 45.24 us +0.56% SimpleComplexArithmetic: 1652.05 ms 7.51 us +6.67% SimpleDictManipulation: 1150.25 ms 3.83 us +18.72% SimpleFloatArithmetic: 889.65 ms 1.62 us -1.52% SimpleIntFloatArithmetic: 900.80 ms 1.36 us -0.40% SimpleIntegerArithmetic: 901.75 ms 1.37 us -0.72% SimpleListManipulation: 1125.40 ms 4.17 us +4.15% SimpleLongArithmetic: 1305.15 ms 7.91 us +7.62% SmallLists: 2102.85 ms 8.25 us +8.18% SmallTuples: 1329.55 ms 5.54 us +4.98% SpecialClassAttribute: 1234.60 ms 2.06 us +19.23% SpecialInstanceAttribute: 1422.55 ms 2.37 us +11.28% StringMappings: 1585.55 ms 12.58 us +11.46% StringPredicates: 1241.35 ms 4.43 us -9.69% StringSlicing: 1206.20 ms 6.89 us +15.12% TryExcept: 1764.35 ms 1.18 us -0.81% TryRaiseExcept: 1217.40 ms 81.16 us +2.77% TupleSlicing: 933.00 ms 8.89 us -1.90% UnicodeMappings: 1137.35 ms 63.19 us -0.49% UnicodePredicates: 1632.05 ms 7.25 us +7.43% UnicodeProperties: 1244.05 ms 6.22 us +5.44% UnicodeSlicing: 1252.10 ms 7.15 us +9.27% ------------------------------------------------------------------------ Average round time: 58804.00 ms +4.73% *) measured against: /home/lemburg/tmp/pybench-2.0.pyb (rounds=10, warp=20) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
what results do you get on Windows ?
Win2k, dual 800, relatively quiet! Python 2.0 F:\src\Python-2.0\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.847605 This machine benchmarks at 11798 pystones/second F:\src\Python-2.0\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.845104 This machine benchmarks at 11832.9 pystones/second F:\src\Python-2.0\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.846069 This machine benchmarks at 11819.4 pystones/second F:\src\Python-2.0\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.849447 This machine benchmarks at 11772.4 pystones/second Python from CVS today: F:\src\python-cvs\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.885801 This machine benchmarks at 11289.2 pystones/second F:\src\python-cvs\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.889048 This machine benchmarks at 11248 pystones/second F:\src\python-cvs\PCbuild>python ..\lib\test\pystone.py Pystone(1.1) time for 10000 passes = 0.892422 This machine benchmarks at 11205.5 pystones/second Although I deleted Tim's earlier mail, from memory this is pretty similar in terms of performance lost. I'm afraid I have no idea what your benchmarks are or how to build them <wink>, but did check that the optimizer is set for "mazimize for speed" (/O2). Other compiler options gave significantly smaller results (no optimizations around 8500, and "optimize for space" (/O1) at around 10000). Other fiddling with the optimizer couldn't get better results than the existing settings. Mark.
I hope another set of benchmarks isn't overkill for the list. I see different results comparing 2.1 with 2.0 (both -O3) using pybench 0.6. The interesting differences I see in this benchmark that I didn't see in MAL's are: DictCreation +15.87% SeoncdImport +20.29% Other curious differences, which show up in both benchmarks, include: SpecialClassAttribute +17.91% (private variables) SpecialInstanceAttribute +15.34% (__methods__) Jeremy PYBENCH 0.6 Benchmark: py21 (rounds=10, warp=20) Tests: per run per oper. diff * ------------------------------------------------------------------------ BuiltinFunctionCalls: 305.05 ms 2.39 us +4.77% BuiltinMethodLookup: 319.65 ms 0.61 us +2.55% ConcatStrings: 383.70 ms 2.56 us +1.27% CreateInstances: 463.85 ms 11.04 us +1.96% CreateStringsWithConcat: 381.20 ms 1.91 us +2.39% DictCreation: 508.85 ms 3.39 us +15.87% ForLoops: 577.60 ms 57.76 us +5.65% IfThenElse: 443.70 ms 0.66 us +1.02% ListSlicing: 207.50 ms 59.29 us -4.18% NestedForLoops: 315.75 ms 0.90 us +3.54% NormalClassAttribute: 379.80 ms 0.63 us +7.39% NormalInstanceAttribute: 385.45 ms 0.64 us +8.04% PythonFunctionCalls: 400.00 ms 2.42 us +13.62% PythonMethodCalls: 306.25 ms 4.08 us +5.13% Recursion: 337.25 ms 26.98 us +19.00% SecondImport: 301.20 ms 12.05 us +20.29% SecondPackageImport: 298.20 ms 11.93 us +18.15% SecondSubmoduleImport: 339.15 ms 13.57 us +11.40% SimpleComplexArithmetic: 392.70 ms 1.79 us -10.52% SimpleDictManipulation: 350.40 ms 1.17 us +3.87% SimpleFloatArithmetic: 300.75 ms 0.55 us +2.04% SimpleIntFloatArithmetic: 347.95 ms 0.53 us +9.01% SimpleIntegerArithmetic: 356.40 ms 0.54 us +12.01% SimpleListManipulation: 351.85 ms 1.30 us +11.33% SimpleLongArithmetic: 309.00 ms 1.87 us -5.81% SmallLists: 584.25 ms 2.29 us +10.20% SmallTuples: 442.00 ms 1.84 us +10.33% SpecialClassAttribute: 406.50 ms 0.68 us +17.91% SpecialInstanceAttribute: 557.40 ms 0.93 us +15.34% StringSlicing: 336.45 ms 1.92 us +9.56% TryExcept: 650.60 ms 0.43 us +1.40% TryRaiseExcept: 345.95 ms 23.06 us +2.70% TupleSlicing: 266.35 ms 2.54 us +4.70% ------------------------------------------------------------------------ Average round time: 14413.00 ms +7.07% *) measured against: py20 (rounds=10, warp=20)
"MAL" == M -A Lemburg <mal@lemburg.com> writes:
MAL> Yes... after a check of the Makefile I found that I had MAL> compiled Python 2.0 with -O3 and 2.1a1 with -O2 -- perhaps this MAL> makes a difference w/r to inlining of code. I'll recompile and MAL> rerun the benchmark. When I was working in the CALL_FUNCTION revision, I compared 2.0 final with my development working using -O3. At that time, I saw no significant performance difference between the two. And I did notice a difference between -O2 and -O3. The strange thing is that I notice a difference between -O2 and -O3 with 2.1a1, but in the opposite direction. On pystone, python -O2 runs consistently faster than -O3; the difference is .05 sec on my machine. Jeremy
Jeremy Hylton <jeremy@alum.mit.edu>:
The strange thing is that I notice a difference between -O2 and -O3 with 2.1a1, but in the opposite direction. On pystone, python -O2 runs consistently faster than -O3; the difference is .05 sec on my machine.
Bizarre. Make me wonder if we have a C compiler problem. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> In every country and in every age, the priest has been hostile to liberty. He is always in alliance with the despot, abetting his abuses in return for protection to his own. -- Thomas Jefferson, 1814
"ESR" == Eric S Raymond <esr@thyrsus.com> writes:
ESR> Jeremy Hylton <jeremy@alum.mit.edu>:
The strange thing is that I notice a difference between -O2 and -O3 with 2.1a1, but in the opposite direction. On pystone, python -O2 runs consistently faster than -O3; the difference is .05 sec on my machine.
ESR> Bizarre. Make me wonder if we have a C compiler problem. Depends on your defintion of "compiler problem" <wink>. If you mean, it compiles our code so it runs slower, then, yes, we've got one :-). One of the differences between -O2 and -O3, according to the man page, is that -O3 will perform optimizations that involve a space-speed tradeoff. It also include -finline-functions. I can imagine that some of these optimizations hurt memory performance enough to make a difference. not-really-understanding-but-not-really-expecting-too-ly y'rs, Jeremy
On Mon, Jan 29, 2001 at 12:27:08PM -0500, Jeremy Hylton wrote:
Depends on your defintion of "compiler problem" <wink>. If you mean, it compiles our code so it runs slower, then, yes, we've got one :-).
Compiling with gcc and -g, with no optimization, 2.0 and 2.1cvs seem to be very close, with 2.1 slightly slower: 2.0: Pystone(1.1) time for 10000 passes = 1.04 This machine benchmarks at 9615.38 pystones/second This machine benchmarks at 9345.79 pystones/second This machine benchmarks at 9433.96 pystones/second This machine benchmarks at 9433.96 pystones/second This machine benchmarks at 9523.81 pystones/second 2.1cvs: Pystone(1.1) time for 10000 passes = 1.09 This machine benchmarks at 9174.31 pystones/second This machine benchmarks at 9090.91 pystones/second This machine benchmarks at 9259.26 pystones/second This machine benchmarks at 9174.31 pystones/second This machine benchmarks at 9090.91 pystones/second Would it be worth experimenting with platform-specific compiler options to try to squeeze out the last bit of performance (can wait for the betas, probably). --amk
Note that optimizing compilers use a pile of linear-time heuristics to attempt to solve exponential-time optimization problems (from optimal register assignment to optimal instruction scheduling, they're all formally intractable even in isolation). When code gets non-trivial, not even a compiler's chief designer can reliably outguess what optimization may do. It's really not unusual for a higher optimization level to yield slower code, and especially not when the source code is pushing or exceeding machine limits (# of registers, # of instruction pipes, size of branch-prediction buffers; I-cache structure; dynamic restrictions on execution units; ...). [Jeremy]
... One of the differences between -O2 and -O3, according to the man page, is that -O3 will perform optimizations that involve a space-speed tradeoff. It also include -finline-functions. I can imagine that some of these optimizations hurt memory performance enough to make a difference.
One of the time-consuming ongoing tasks at my last employer was running profiles and using them to override counterproductive compiler inlining decisions (in both directions). It's not just memory that excessive inlining can screw up, but also things like running out of registers and so inserting gobs of register spill/restore code, and inlining so much code that the instruction scheduler effectively gives up (under many compilers, a sure sign of this is when you look at the generated code for a function, and it looks beautiful "at the top" but terrible "at the bottom"; some clever optimizers tried to get around that by optimizing "bottom-up", and then it looks beautiful at the bottom but terrible at the top <0.5 wink>; others work middle-out or burn the candle at both ends, with visible consequences you should be able to recognize now!). optimization-is-easier-than-speech-recog-but-the-latter-doesn't-work- all-that-well-either-ly y'rs - tim
The recursion test in pybench is testing the performance of the nested scopes changes, which must do some extra bookkeeping to reference the recursive function in a nested scope. To some extent, a performance hit is a necessary consequence for nested functions with free variables. Nonetheless, there are two interesting things to say about this situation. First, there is a bug in the current implementation of nested scopes that the benchmark tickles. The problem is with code like this: def outer(): global f def f(x): if x > 0: return f(x - 1) The compiler determines that f is free in f. (It's recursive.) If f is free in f, in the absence of the global decl, the body of outer must allocate fresh storage (a cell) for f each time outer is called and add a reference to that cell to f's closure. If f is declared global in outer, then it ought to be treated as a global in nested scopes, too. In general terms, a free variable should use the binding found in the nearest enclosing scope. If the nearest enclosing scope has a global binding, then the reference is global. If I fix this problem, the recursion benchmark shouldn't be any slower than a normal function call. The second interesting thing to say is that frame allocation and dealloc is probably more expensive than it needs to be in the current implementation. The frame object has a new f_closure slot that holds a tuple that is freshly allocated every time the frame is allocated. (Unless the closure is empty, then f_closure is just NULL.) The extra tuple allocation can probably be done away with by using the same allocation strategy as locals & stack. If the f_localsplus array holds cells + frees + locals + stack, then a new frame will never require more than a single malloc (and often not even that). Jeremy
participants (8)
-
A.M. Kuchling
-
Andrew M. Kuchling
-
Eric S. Raymond
-
Fredrik Lundh
-
Jeremy Hylton
-
M.-A. Lemburg
-
Mark Hammond
-
Tim Peters