Python 3.10 vs 3.8 performance degradation
Hello, Being a programmer myself I realise that a report on performance degradation should ideally contain a small test program that clearly reproduces the problem. However, unfortunately, I do not have the time at present to isolate the issue to a small test case. But the good news (or bad news, I suppose) is that the problem appears to be reasonably general, namely it happens with two completely different programs. Anyway, what I am claiming is that Python 3.10 is between 1.5 and 2.5 times SLOWER than Python 3.8, for rather generic scientific calculations such as Fourier analysis, ODE solving and plotting. On the one hand, the "test case" is a rather complex program that calculates Wigner function of a quantum system and the result is 9 seconds when run with 3.8 and 23 seconds when run with 3.10 (very easy to reproduce: just clone this repository: https://github.com/tigran123/quantum-infodynamics and run "time bin/harmonic-oscillator-solve.sh" from the dynamics subdirectory and then edit initgauss.py and solve.py to point to python3.10 and run it again). Make sure your TMPDIR points somewhere fast. My machine is a very fast 6-core i7-6800K at 4.2GHz and 128GB RAM. The storage is also a very fast NVMe, about 3GB/s. After this try a completely different program which simulates a mathematical pendulum using PyQT (GUI) and it gives FPS:14-15 when run with 3.8 and only 11-12 when run with 3.10. Again, it is easy to reproduce if you have cloned the above repository: just go to classical-mechanics/pendulum subdirectory and run psim.py (click on the Play button in the control window and observe FPS in the plot window). Then edit psim.py to point to Python 3.10 and run it again. You would need PyQt5, matplotlib, numpy, scipy, pyFFTW for these programs to work. I realise that you would much prefer a small specific test case, but I still hope that this report is "better than nothing". I do really desire to help improve Python and will provide more information if requested. I use Python everywhere, even in Termux on Android, and am quite saddened by this degradation... With Python 3.8 I used these package versions: matplotlib 3.1.3 numpy 1.18.1 pyFFTW 0.12.0 PyQt5 5.13.2 scipy 1.4.1 With Python 3.10 I used these package versions: matplotlib 3.5.0 numpy 1.21.4 pyFFTW 0.12.0 PyQt5 5.15.6 scipy 1.7.3 Both Python 3.8 and 3.10 were compiled and installed by myself with "./configure --enable-optimizations ; make ; sudo make install". Kind regards, Tigran
My guess is that this difference is predominantly different builds of
NumPy. For example, the Intel optimized builds are very good, and a speed
difference of the magnitude shown in this note are typical. E.g.
https://www.intel.com/content/www/us/en/developer/articles/technical/numpysc...
On Sun, Dec 19, 2021 at 12:24 PM
Hello,
Being a programmer myself I realise that a report on performance degradation should ideally contain a small test program that clearly reproduces the problem. However, unfortunately, I do not have the time at present to isolate the issue to a small test case. But the good news (or bad news, I suppose) is that the problem appears to be reasonably general, namely it happens with two completely different programs.
Anyway, what I am claiming is that Python 3.10 is between 1.5 and 2.5 times SLOWER than Python 3.8, for rather generic scientific calculations such as Fourier analysis, ODE solving and plotting. On the one hand, the "test case" is a rather complex program that calculates Wigner function of a quantum system and the result is 9 seconds when run with 3.8 and 23 seconds when run with 3.10 (very easy to reproduce: just clone this repository: https://github.com/tigran123/quantum-infodynamics and run "time bin/harmonic-oscillator-solve.sh" from the dynamics subdirectory and then edit initgauss.py and solve.py to point to python3.10 and run it again). Make sure your TMPDIR points somewhere fast. My machine is a very fast 6-core i7-6800K at 4.2GHz and 128GB RAM. The storage is also a very fast NVMe, about 3GB/s.
After this try a completely different program which simulates a mathematical pendulum using PyQT (GUI) and it gives FPS:14-15 when run with 3.8 and only 11-12 when run with 3.10. Again, it is easy to reproduce if you have cloned the above repository: just go to classical-mechanics/pendulum subdirectory and run psim.py (click on the Play button in the control window and observe FPS in the plot window). Then edit psim.py to point to Python 3.10 and run it again. You would need PyQt5, matplotlib, numpy, scipy, pyFFTW for these programs to work.
I realise that you would much prefer a small specific test case, but I still hope that this report is "better than nothing". I do really desire to help improve Python and will provide more information if requested. I use Python everywhere, even in Termux on Android, and am quite saddened by this degradation...
With Python 3.8 I used these package versions:
matplotlib 3.1.3 numpy 1.18.1 pyFFTW 0.12.0 PyQt5 5.13.2 scipy 1.4.1
With Python 3.10 I used these package versions:
matplotlib 3.5.0 numpy 1.21.4 pyFFTW 0.12.0 PyQt5 5.15.6 scipy 1.7.3
Both Python 3.8 and 3.10 were compiled and installed by myself with "./configure --enable-optimizations ; make ; sudo make install".
Kind regards, Tigran _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/34FP7FEG... Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same.
"Exactly the same" between Python versions, or exactly the same as previously reported? On Sun, 2021-12-19 at 18:48 +0000, Tigran Aivazian wrote:
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/THPN4OWM... Code of Conduct: http://python.org/psf/codeofconduct/
Alas, it is exactly the same as previously reported, so the problem persists. If it was exactly the same between Python versions I would celebrate and shout for joy, seeing that the problem is narrowed down to numpy. I can carefully upgrade all the other packages in 3.8 to match those in 3.10. As I can downgrade (I will test it first), I should be able to restore my "superfast 3.8 environment", should this upgrade break it. I will report what I discover.
Not the version, but the build. Did you compile NumPy from source using the
same compiler with both Python versions? If not, that remains my strong
hunch about performance difference.
Given what your programs do, it sure seems like the large majority of
runtime is spent in supporting numeric libraries, not in Python interpreter
itself.
Profiling is the way to find out.
On Sun, Dec 19, 2021, 1:52 PM Tigran Aivazian
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/THPN4OWM... Code of Conduct: http://python.org/psf/codeofconduct/
In both cases I installed numpy using "sudo -H pip install numpy". And just now I upgraded numpy in 3.8 using "sudo -H pip3.8 install --upgrade numpy". I will try to simplify the program by removing all the higher level complexity and see what I find.
These are binary wheel installs though, no? Which means 3.8 version and 3.10 version were compiled at different times, even for the same NumPy version. Also for different platforms, I don't know which you are on. I haven't checked what's on PyPI for each version. I think PyFFT is largely using NumPy. You can find details with something like
import numpy.distutils numpy.distutils.unixccompiler.sysconfig.get_config_vars()
I suspect that will indicate interesting compiler differences even for the
"same version."
As Chris Barker mentions, this will probably find people more familiar with
the issue on the NumPy mailing list.
On Sun, Dec 19, 2021, 2:11 PM Tigran Aivazian
In both cases I installed numpy using "sudo -H pip install numpy". And just now I upgraded numpy in 3.8 using "sudo -H pip3.8 install --upgrade numpy".
I will try to simplify the program by removing all the higher level complexity and see what I find. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SPI6K4LN... Code of Conduct: http://python.org/psf/codeofconduct/
I think I have found something very interesting. Namely, I removed all multiprocessing (which is done in the shell script, not in Python) and so reduced the program to just a single thread of execution. And lo and behold, Python 3.10 now consistently beats 3.8 by about 5%. However, this is not the END! Namely, it is very important to find out why when running multiple processes simultaneously 3.8 still outperforms 3.10. The thing is -- all these different threads write to completely unrelated data files (.npz and .npy) The only thing they all have in common is the initial data, which they all read from the same 'init.npz' and 'init_W.npy' files using: with load(args.ifilename + '.npz', allow_pickle=True) as data: and Winit = memmap(iWfilename, dtype='float64', mode='r', shape=(Nt, Nx, Np)) So, could this be the problem?
I have created four different sets of initial data, one for each thread of execution and no, unfortunately, that does NOT solve the problem. Still, when four threads are executed in parallel, 3.8 outperforms 3.10 by a factor of 2.4. So, there is some other point of contention between the threads, which I need to find...
So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays). All of this together (or some one part of it, yet to be discovered) is behaving as if there was some global lock taken behind the scene (i.e. inside Python interpreter), so that when multiple instances of the script (which I loosely called "threads" in previous posts, but here correct myself as the word "threads" is used more appropriately in the context of FFT in this message) are executed in parallel, they slow each other down in 3.10, but not so in 3.8. So this is definitely a very interesting 3.10 degradation problem. I will try to investigate some more tomorrow...
I have got it narrowed down to the "threads=6" argument of fft() and ifft() functions of pyFFTW! Namely, if I do NOT pass "threads=6" to fft()/iff(), then the parallel execution of multiple instances of the scripts is the same in Python 3.8 and 3.10. But it is a bit slower than with "threads=6", of course (as my "multiprocessing" on the shell script level is tied to the multiple physical problems being solved simultaneously and this number is small -- say 4, but I have 12 processors (6 physical cores) which could execute code in parallel). So, this is where we are right now: the version pyFFTW 0.12.0 on Python 3.8 with threads=6 is 2.4 times faster than the same version 0.12.0 pyFFTW on Python 3.10, when four scripts are executed in parallel. But removing "threads=6" makes 3.10 much faster, and 3.8 a bit slower. Though not too slow -- instead of 9 vs 23 seconds I get 11.2 (Python 3.8) vs 10.8 (Python 3.10) seconds, so Python 3.10 is even a little bit faster than 3.8, but still not as fast as with threads=6 on 3.8. However, that pendulum PyQT GUI application does NOT do any Fourier transforms! So, the problem with FPS in pendulum plotting is something different.
On 2021-12-19 20:06, Tigran Aivazian wrote:
So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays). All of this together (or some one part of it, yet to be discovered) is behaving as if there was some global lock taken behind the scene (i.e. inside Python interpreter), so that when multiple instances of the script (which I loosely called "threads" in previous posts, but here correct myself as the word "threads" is used more appropriately in the context of FFT in this message) are executed in parallel, they slow each other down in 3.10, but not so in 3.8.
So this is definitely a very interesting 3.10 degradation problem. I will try to investigate some more tomorrow...
"is behaving as if there was some global lock taken behind the scene (i.e. inside Python interpreter)"? The Python interpreter does have the GIL (Global Interpreter Lock). It can't execute Python bytecodes in parallel, but timeshares between the threads. The GIL is released during I/O and by some extensions while they're processing, but when they want to return, or if they want to use the Python API, they need to acquire the GIL again. The only way to get true parallelism in CPython is to use multiprocessing, where it's running in multiple processes.
On Sun, Dec 19, 2021 at 1:46 PM MRAB
On 2021-12-19 20:06, Tigran Aivazian wrote:
So far I have narrowed it down to a block of code in solve.py doing a lot of multi-threaded FFT (i.e. with fft(..., threads=6) of pyFFTW), as well as numpy exp() and other functions and pure Python heavy list manipulation (yes, lists, not numpy arrays).
The Python interpreter does have the GIL (Global Interpreter Lock). It can't execute Python bytecodes in parallel, but timeshares between the threads.
Sure. But what the OP seems to have discovered is that there is some difference in behavior between 3.8 and 3.10 -- and AFIAK, there are not intended major changes in the GIL between those two releases. I *think* that all of the issues have involved numpy (pyFFTW depends on numpy as well), and certainly matplotlib does) -- but I think the OP has made sure that the numpy (and other libs) versions are all the same. There still remains to confirm that numpy (and other libs) are built exactly the same way in the py3.8 and 3.10 versions -- this can be a very complicated stack! But it seems either cPython itself, or numpy (or Cyhton?) is doing something different. Still to be discovered what that is. Note the OP: make sure that it's not as simple as a change to the default for the threads parameter. Note2: even if this is a regression cPython itself, I suspect the numpy list may be a better wey to get it figured out. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, 2021-12-19 at 18:48 +0000, Tigran Aivazian wrote:
To eliminate the possibility of being affected by the different versions of numpy I have just now upgraded numpy in Python 3.8 environment to the latest version, so both 3.8 and 3.10 and using numpy 1.21.4 and still the timing is exactly the same.
NumPy is very unlikely to have gotten slower. Please please time your script before jumping to conclusion. For example 2/3 of the time of that pendulum plotter is spend in plotting, and most of that seems to be spend in text rendering. (Yeah, there is a a little bit of time in NumPy's `arr.take()` also, but I doubt that has anything to do with this.) Now, I don't know what does the text rendering, but maybe that got slower. Cheers, Sebastian
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/THPN4OWM... Code of Conduct: http://python.org/psf/codeofconduct/
On 12/19/21 06:46, aivazian.tigran@gmail.com wrote:
Hello,
Being a programmer myself I realise that a report on performance degradation should ideally contain a small test program that clearly reproduces the problem. However, unfortunately, I do not have the time at present to isolate the issue to a small test case. But the good news (or bad news, I suppose) is that the problem appears to be reasonably general, namely it happens with two completely different programs.
Just FYI (if you didn't already know), there is long-term tracking of performance benchmarks which you can see reflected at https://speed.python.org. The intent is that things not come as a surprise, so if there indeed turns out to be a surprise underneath your issue - and we all know benchmarking of complex workflows is quite tricky - maybe there's a new check that will want to be added there.
participants (8)
-
aivazian.tigran@gmail.com
-
Christopher Barker
-
David Mertz, Ph.D.
-
Mats Wichmann
-
MRAB
-
Paul Bryan
-
Sebastian Berg
-
Tigran Aivazian