
Python 1.6a2 is around 10% slower than 1.5 on pystone. Any idea why? [amk@mira Python-1.6a2]$ ./python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.59 This machine benchmarks at 2785.52 pystones/second [amk@mira Python-1.6a2]$ python1.5 Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 3.19 This machine benchmarks at 3134.8 pystones/second --amk

"A.M. Kuchling" wrote:
Hee hee :-) D:\python>python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.92135 This machine benchmarks at 5204.66 pystones/second D:\python>cd \python16 D:\Python16>python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 2.06234 This machine benchmarks at 4848.86 pystones/second D:\Python16>cd \python\spc D:\python\spc>python Lib/test/pystone.py python: can't open file 'Lib/test/pystone.py' D:\python\spc>python ../Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.81034 This machine benchmarks at 5523.82 pystones/second More hee hee :-) Python has been at a critical size with its main loop. The recently added extra code exceeds this size. I had the same effect with Stackless Python, and I worked around it already. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Mark Hammond wrote:
My work-arounds originated from code from last January where I was on a speed trip, but with the (usual) low interest from Guido. Then, with Stackless I saw a minor speed loss and finally came to the conclusion that I would be good to apply my patches to my Python version. That was nothing special so far, and Stackless was still a bit slow. I though this came from the different way to call functions for quite a long time, until I finally found out this February: The central loop of the Python interpreter is at a critical size for caching. Speed depends very much on which code gets near which other code, and how big the whole interpreter loop is. What I did: - Un-inlined several code pieces again, back into functions in order to make the big switch smaller. - simplified error handling, especially I ensured that all local error variables have very short lifetime and are optimized away - simplified the big switch, tuned the why_code handling into special opcodes, therefore the whole things gets much simpler. This reduces code size and therefore the probability that we are in the cache, and due to short variable lifetime and a simpler loop structure, the compiler seems to do a better job of code ordering.
Only-2-more-years-of-beating-up-Guido-before-stackless-time-ly,
Yup, and until then I will not apply my patches to Python, this is part of my license: Use it but only *with* Stackless. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

"A.M. Kuchling" wrote:
Python 1.6a2 is around 10% slower than 1.5 on pystone. Any idea why?
I submitted a comparison with Stackless Python. Now I actually applied the Stackless Python patches to the current CVS version. My version does again show up as faster than standard Python, with the same relative measures, but I too have this effect: Stackless 1.5.2+ is 10 percent faster than Stackless 1.6a2. Claim: This is not related to ceval.c . Something else must have introduced a significant speed loss. Stackless Python, upon the pre-unicode tag version of CVS: D:\python\spc>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.80724 This machine benchmarks at 5533.29 pystones/second Stackless Python, upon the recent version of CVS: D:\python\spc\Python-cvs\PCbuild>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.94941 This machine benchmarks at 5129.75 pystones/second Less than 10 percent, but bad enough. I guess we have to use MAL's test suite and measure everything alone. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Ack, sorry. Please drop the last message. This one was done with the correct dictionaries. :-() Christian Tismer wrote:
this one corrected: D:\python\spc\Python-slp\PCbuild>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.98433 This machine benchmarks at 5039.49 pystones/second
Less than 10 percent, but bad enough.
It is 10 percent, and bad enough.
-- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

I guess I can explain now what's happening, at least for the Windows platform. Python 1.5.2's .dll was nearly about 512K, something more. I think to remember that 512K is a common size of the secondary cache. Now, linking with the MS linker does not give you any particularly useful order of modules. When I look into the map file, the modules appear sorted by name. This is for sure not providing optimum performance. As I read the docs, explicit ordering of the linkage would only make sense for C++ and wouldn't work out for C, since we could order the exported functions, but not the private ones, giving even more distance between releated code. My solution to see if I might be right was this: I ripped out almost all builtin extension modules and compiled/linked without them. This shrunk the dll size down from 647K to 557K, very close to the 1.5.2 size. Now I get the following figures: Python 1.6, with stackless patches: D:\python\spc\Python-slp\PCbuild>python /python/lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.95468 This machine benchmarks at 5115.92 pystones/second Python 1.6, from the dist: D:\Python16>python /python/lib/test/pystone.py Pystone(1.1) time for 10000 passes = 2.09214 This machine benchmarks at 4779.8 pystones/second That means my optimizations are in charge again, after the overall code size went below about 512K. I think these 10 percent are quite valuable. These options come to my mind: a) try to do optimum code ordering in the too large .dll . This seems to be hard to achieve. b) Split the dll into two dll's in a way that all the necessary internal stuff sits closely in one of them. c) try to split the library like above, but use a static library layout for one of them, and link the static library into the final dll. This would hopefully keep related things together. I don't know if c) is possible, but it might be tried. Any thoughts? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Sorry, it was not really found... Christian Tismer wrote: [thought he had found the speed leak] After re-inserting all the builtin modules, I got nearly the same result after a complete re-build, just marginally slower. There must something else be happening that I cannot understand. Stackless Python upon 1.5.2+ is still nearly 10 percent faster, regardless what I do to Python 1.6. Testing whether Unicode has some effect? I changed PyUnicode_Check to always return 0. This should optimize most related stuff away. Result: No change at all! Which changes were done after the pre-unicode tag, which might really count for performance? I'm quite desperate, any ideas? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

The performance difference I see on my Sparc is smaller. The machine is a 200MHz Ultra Sparc 2 with 256MB of RAM, built both versions with GCC 2.8.1. It appears that 1.6a2 is about 3.3% slower. The median pystone time taken from 10 measurements are: 1.5.2 4.87 1.6a2 5.035 For comparison, the numbers I see on my Linux box (dual PII 266) are: 1.5.2 3.18 1.6a2 3.53 That's about 10% faster under 1.5.2. I'm not sure how important this change is. Three percent isn't enough for me to worry about, but it's a minority platform. I suppose 10 percent is right on the cusp. If the performance difference is the cost of the many improvements of 1.6, I think it's worth the price. Jeremy

Jeremy Hylton wrote:
Which GCC was it on the Linux box, and how much RAM does it have?
Yes, and I'm happy to pay the price if I can see where I pay. That's the problem, the changes between the pre-unicode tag and the current CVS are not enough to justify that speed loss. There must be something substantial. I also don't grasp why my optimizations are so much more powerful on 1.5.2+ as on 1.6 . Mark Hammond pointed me to the int/long unification. Was this done *after* the unicode patches? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Christian Tismer writes:
Mark Hammond pointed me to the int/long unification. Was this done *after* the unicode patches?
Before. It seems unlikely they're the cause (they just add a 'if (PyLong_Check(key)' branch to the slicing functions in abstract.c. OTOH, if pystone really exercises sequence multiplication, maybe they're related (but 10% worth?). -- A.M. Kuchling http://starship.python.net/crew/amk/ I know flattery when I hear it; but I do not often hear it. -- Robertson Davies, _Fifth Business_

"A.M. Kuchling" wrote:
Hee hee :-) D:\python>python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.92135 This machine benchmarks at 5204.66 pystones/second D:\python>cd \python16 D:\Python16>python Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 2.06234 This machine benchmarks at 4848.86 pystones/second D:\Python16>cd \python\spc D:\python\spc>python Lib/test/pystone.py python: can't open file 'Lib/test/pystone.py' D:\python\spc>python ../Lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.81034 This machine benchmarks at 5523.82 pystones/second More hee hee :-) Python has been at a critical size with its main loop. The recently added extra code exceeds this size. I had the same effect with Stackless Python, and I worked around it already. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Mark Hammond wrote:
My work-arounds originated from code from last January where I was on a speed trip, but with the (usual) low interest from Guido. Then, with Stackless I saw a minor speed loss and finally came to the conclusion that I would be good to apply my patches to my Python version. That was nothing special so far, and Stackless was still a bit slow. I though this came from the different way to call functions for quite a long time, until I finally found out this February: The central loop of the Python interpreter is at a critical size for caching. Speed depends very much on which code gets near which other code, and how big the whole interpreter loop is. What I did: - Un-inlined several code pieces again, back into functions in order to make the big switch smaller. - simplified error handling, especially I ensured that all local error variables have very short lifetime and are optimized away - simplified the big switch, tuned the why_code handling into special opcodes, therefore the whole things gets much simpler. This reduces code size and therefore the probability that we are in the cache, and due to short variable lifetime and a simpler loop structure, the compiler seems to do a better job of code ordering.
Only-2-more-years-of-beating-up-Guido-before-stackless-time-ly,
Yup, and until then I will not apply my patches to Python, this is part of my license: Use it but only *with* Stackless. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

"A.M. Kuchling" wrote:
Python 1.6a2 is around 10% slower than 1.5 on pystone. Any idea why?
I submitted a comparison with Stackless Python. Now I actually applied the Stackless Python patches to the current CVS version. My version does again show up as faster than standard Python, with the same relative measures, but I too have this effect: Stackless 1.5.2+ is 10 percent faster than Stackless 1.6a2. Claim: This is not related to ceval.c . Something else must have introduced a significant speed loss. Stackless Python, upon the pre-unicode tag version of CVS: D:\python\spc>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.80724 This machine benchmarks at 5533.29 pystones/second Stackless Python, upon the recent version of CVS: D:\python\spc\Python-cvs\PCbuild>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.94941 This machine benchmarks at 5129.75 pystones/second Less than 10 percent, but bad enough. I guess we have to use MAL's test suite and measure everything alone. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Ack, sorry. Please drop the last message. This one was done with the correct dictionaries. :-() Christian Tismer wrote:
this one corrected: D:\python\spc\Python-slp\PCbuild>python ../lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.98433 This machine benchmarks at 5039.49 pystones/second
Less than 10 percent, but bad enough.
It is 10 percent, and bad enough.
-- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

I guess I can explain now what's happening, at least for the Windows platform. Python 1.5.2's .dll was nearly about 512K, something more. I think to remember that 512K is a common size of the secondary cache. Now, linking with the MS linker does not give you any particularly useful order of modules. When I look into the map file, the modules appear sorted by name. This is for sure not providing optimum performance. As I read the docs, explicit ordering of the linkage would only make sense for C++ and wouldn't work out for C, since we could order the exported functions, but not the private ones, giving even more distance between releated code. My solution to see if I might be right was this: I ripped out almost all builtin extension modules and compiled/linked without them. This shrunk the dll size down from 647K to 557K, very close to the 1.5.2 size. Now I get the following figures: Python 1.6, with stackless patches: D:\python\spc\Python-slp\PCbuild>python /python/lib/test/pystone.py Pystone(1.1) time for 10000 passes = 1.95468 This machine benchmarks at 5115.92 pystones/second Python 1.6, from the dist: D:\Python16>python /python/lib/test/pystone.py Pystone(1.1) time for 10000 passes = 2.09214 This machine benchmarks at 4779.8 pystones/second That means my optimizations are in charge again, after the overall code size went below about 512K. I think these 10 percent are quite valuable. These options come to my mind: a) try to do optimum code ordering in the too large .dll . This seems to be hard to achieve. b) Split the dll into two dll's in a way that all the necessary internal stuff sits closely in one of them. c) try to split the library like above, but use a static library layout for one of them, and link the static library into the final dll. This would hopefully keep related things together. I don't know if c) is possible, but it might be tried. Any thoughts? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Sorry, it was not really found... Christian Tismer wrote: [thought he had found the speed leak] After re-inserting all the builtin modules, I got nearly the same result after a complete re-build, just marginally slower. There must something else be happening that I cannot understand. Stackless Python upon 1.5.2+ is still nearly 10 percent faster, regardless what I do to Python 1.6. Testing whether Unicode has some effect? I changed PyUnicode_Check to always return 0. This should optimize most related stuff away. Result: No change at all! Which changes were done after the pre-unicode tag, which might really count for performance? I'm quite desperate, any ideas? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

The performance difference I see on my Sparc is smaller. The machine is a 200MHz Ultra Sparc 2 with 256MB of RAM, built both versions with GCC 2.8.1. It appears that 1.6a2 is about 3.3% slower. The median pystone time taken from 10 measurements are: 1.5.2 4.87 1.6a2 5.035 For comparison, the numbers I see on my Linux box (dual PII 266) are: 1.5.2 3.18 1.6a2 3.53 That's about 10% faster under 1.5.2. I'm not sure how important this change is. Three percent isn't enough for me to worry about, but it's a minority platform. I suppose 10 percent is right on the cusp. If the performance difference is the cost of the many improvements of 1.6, I think it's worth the price. Jeremy

Jeremy Hylton wrote:
Which GCC was it on the Linux box, and how much RAM does it have?
Yes, and I'm happy to pay the price if I can see where I pay. That's the problem, the changes between the pre-unicode tag and the current CVS are not enough to justify that speed loss. There must be something substantial. I also don't grasp why my optimizations are so much more powerful on 1.5.2+ as on 1.6 . Mark Hammond pointed me to the int/long unification. Was this done *after* the unicode patches? ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com

Christian Tismer writes:
Mark Hammond pointed me to the int/long unification. Was this done *after* the unicode patches?
Before. It seems unlikely they're the cause (they just add a 'if (PyLong_Check(key)' branch to the slicing functions in abstract.c. OTOH, if pystone really exercises sequence multiplication, maybe they're related (but 10% worth?). -- A.M. Kuchling http://starship.python.net/crew/amk/ I know flattery when I hear it; but I do not often hear it. -- Robertson Davies, _Fifth Business_
participants (6)
-
A.M. Kuchling
-
Andrew M. Kuchling
-
Christian Tismer
-
Christian Tismer
-
Jeremy Hylton
-
Mark Hammond