Updated 'High Performance Python' tutorial (the one from EuroPython 2011)
Dear all, I've published v0.2 of my High Performance Python tutorial write-up from the session I ran at EuroPython: http://ianozsvald.com/2011/07/25/high-performance-python-tutorial-v0-2-from-... Antonio - you asked earlier if the 'expanded math' version of the Mandelbrot solver (using doubles rather than complex numbers) would be faster - I've timed it and it is a bit faster with a nightly build of PyPy, but nowhere near as fast at ShedSkin's generated C output (details below). Maciej - thanks for pointing me at the numpy module. I've added a tiny section showing numpy in PyPy but I haven't converted the Mandelbrot solver to use it (even finishing v0.2 took longer than I'd thought). I'm hoping that some more exposure in the report might bring in more volunteers from outside. Here's a clip from the report in the PyPy section: "By running pypy pure_python.py 1000 1000 on my MacBook it takes 5.9 seconds, running pypy pure_python_2.py 1000 1000 takes 4.9 seconds. (Ian - the only difference with pure_python_2.py is that local dereferences in the tight loop are moved outside the loop, causing fewer dereference operations) As an additional test (not shown in the graphs) I ran pypy shedskin2.py 1000 1000 which runs the expanded math version of the shedskin variant below (this replaces complex numbers with floats and expands abs to avoid the square root). The shedskin2.py result takes 3.2 seconds (which is still much slower than the 0.4s version compiled using shedskin)." The pure_python src is here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m... shedskin2.py is available here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m... I haven't tested whether the warm-up periods for PyPy are significant, possibly they account for much of the difference between ShedSkin and PyPy? I want to revisit this but for the next few weeks I have to go back to other projects. I hope the report brings in some new folk for PyPy, Ian. -- Ian Ozsvald (A.I. researcher, screencaster) ian@IanOzsvald.com http://IanOzsvald.com http://SocialTiesApp.com/ http://MorConsulting.com/ http://blog.AICookbook.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald
On Mon, Jul 25, 2011 at 11:00 AM, Ian Ozsvald <ian@ianozsvald.com> wrote:
Dear all, I've published v0.2 of my High Performance Python tutorial write-up from the session I ran at EuroPython: http://ianozsvald.com/2011/07/25/high-performance-python-tutorial-v0-2-from-...
Antonio - you asked earlier if the 'expanded math' version of the Mandelbrot solver (using doubles rather than complex numbers) would be faster - I've timed it and it is a bit faster with a nightly build of PyPy, but nowhere near as fast at ShedSkin's generated C output (details below).
Maciej - thanks for pointing me at the numpy module. I've added a tiny section showing numpy in PyPy but I haven't converted the Mandelbrot solver to use it (even finishing v0.2 took longer than I'd thought). I'm hoping that some more exposure in the report might bring in more volunteers from outside.
Here's a clip from the report in the PyPy section: "By running pypy pure_python.py 1000 1000 on my MacBook it takes 5.9 seconds, running pypy pure_python_2.py 1000 1000 takes 4.9 seconds. (Ian - the only difference with pure_python_2.py is that local dereferences in the tight loop are moved outside the loop, causing fewer dereference operations)
As an additional test (not shown in the graphs) I ran pypy shedskin2.py 1000 1000 which runs the expanded math version of the shedskin variant below (this replaces complex numbers with floats and expands abs to avoid the square root). The shedskin2.py result takes 3.2 seconds (which is still much slower than the 0.4s version compiled using shedskin)."
The pure_python src is here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m...
shedskin2.py is available here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m...
I haven't tested whether the warm-up periods for PyPy are significant, possibly they account for much of the difference between ShedSkin and PyPy? I want to revisit this but for the next few weeks I have to go back to other projects.
Most come from the fact that you're using lists and not say array.array (or numpy array), so the storage is not optimized. ShedSkin doesn't allow you to store different types in a list. We'll make it fast one day even if you use list, but indeed, using array.array would make it much faster. Cheers, fijal
Ah! Ok, I've noted array.array (along with running bigger tests to check for JIT warm-up overheads). Hopefully I'll get some more time in a few weeks to play with variants. Cheers! i. On 25 July 2011 10:08, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Mon, Jul 25, 2011 at 11:00 AM, Ian Ozsvald <ian@ianozsvald.com> wrote:
Dear all, I've published v0.2 of my High Performance Python tutorial write-up from the session I ran at EuroPython: http://ianozsvald.com/2011/07/25/high-performance-python-tutorial-v0-2-from-...
Antonio - you asked earlier if the 'expanded math' version of the Mandelbrot solver (using doubles rather than complex numbers) would be faster - I've timed it and it is a bit faster with a nightly build of PyPy, but nowhere near as fast at ShedSkin's generated C output (details below).
Maciej - thanks for pointing me at the numpy module. I've added a tiny section showing numpy in PyPy but I haven't converted the Mandelbrot solver to use it (even finishing v0.2 took longer than I'd thought). I'm hoping that some more exposure in the report might bring in more volunteers from outside.
Here's a clip from the report in the PyPy section: "By running pypy pure_python.py 1000 1000 on my MacBook it takes 5.9 seconds, running pypy pure_python_2.py 1000 1000 takes 4.9 seconds. (Ian - the only difference with pure_python_2.py is that local dereferences in the tight loop are moved outside the loop, causing fewer dereference operations)
As an additional test (not shown in the graphs) I ran pypy shedskin2.py 1000 1000 which runs the expanded math version of the shedskin variant below (this replaces complex numbers with floats and expands abs to avoid the square root). The shedskin2.py result takes 3.2 seconds (which is still much slower than the 0.4s version compiled using shedskin)."
The pure_python src is here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m...
shedskin2.py is available here: https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing/tree/m...
I haven't tested whether the warm-up periods for PyPy are significant, possibly they account for much of the difference between ShedSkin and PyPy? I want to revisit this but for the next few weeks I have to go back to other projects.
Most come from the fact that you're using lists and not say array.array (or numpy array), so the storage is not optimized. ShedSkin doesn't allow you to store different types in a list. We'll make it fast one day even if you use list, but indeed, using array.array would make it much faster.
Cheers, fijal
-- Ian Ozsvald (A.I. researcher, screencaster) ian@IanOzsvald.com http://IanOzsvald.com http://SocialTiesApp.com/ http://MorConsulting.com/ http://blog.AICookbook.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald
Hello Ian, On 25/07/11 11:00, Ian Ozsvald wrote:
Dear all, I've published v0.2 of my High Performance Python tutorial write-up from the session I ran at EuroPython: http://ianozsvald.com/2011/07/25/high-performance-python-tutorial-v0-2-from-...
today I and Armin investigated a bit more about the performances of the mandelbrot algorithm that you wrote for your tutorial. What we found is very interesting :-). We compared three versions of the code: - a (slightly modified) pure python one on PyPy - the Cython one using calculate_z.pyx_2_bettermath - the shedskin one, using shedskin2.py The PyPy version looks like this: def calculate_z_serial_purepython(q, maxiter, z): """Pure python with complex datatype, iterating over list of q and z""" output = [0] * len(q) for i in range(len(q)): zi = z[i] qi = q[i] for iteration in range(maxiter): zi = zi * zi + qi if (zi.real*zi.real + zi.imag*zi.imag) > 4.0: output[i] = iteration break return output i.e., it is exactly the same as pure_python_2.py, but we avoid to use abs(zi), so it is comparable with the cython and shedskin version. First, we ran the programs to calculate passing "1000 1000" as arguments, and these are the results: PyPy: 1.95 secs Cython: 0.58 secs Shedskin: 0.42 secs so, PyPy is ~4.5x slower than Shedskin. However, we realized that using the default values for x1,x2,y1,y2, the innermost loop runs very few iterations most of the time, and this is one case in which PyPy suffer most, because it needs to go through a bridge to continue the execution, and at the moment bridges are slower than loops. So, we changed the values of x1,x2,y1,y2 to compute a different region, in which the innermost loop runs more frequently. We used these values: x1, x2, y1, y2 = 0.37865401-0.02, 0.37865401+0.02, 0.669227668-0.02, 0.669227668+0.02 and since all programs are faster to compute the image, we used "3000 3000" as arguments from the command line. These are the results: PyPy: 0.89 Cython: 1.76 Shedskin: 0.26 So, in this case, PyPy is ~2x faster than Cython and ~3.5x slower than Shedskin. In the meantime, Armin wrote a C version of it: http://paste.pocoo.org/raw/504216/ which tooks 0.946 seconds to complete. This is in line with the PyPy's result, but we are still investigating why the shedskin's version is so much faster. ciao, Anto
Hi Antonio! Apologies for the slow reply, this got filed into a subfolder. The numbers are interesting, I'm also interested in the C version. I'm hoping that my tutorial will be accepted for PyCon next March (the talks are announced in two weeks), assuming I get to talk again I'll update my tutorial. Adding more for PyPy and having a C equivalent will be very useful. Given that the C version should be very similar to the ShedSkin version, maybe it just comes down to compiler differences? On my Macbook (where I originally wrote the talk) I think the differences in speed came from two versions of gcc (Cython seemed to prefer one, ShedSkin the other, I ran out of time trying to unify that test). Do you definitely use the same optimisation flags? ShedSkin (from memory) requests fast math and a few other things in the generated Makefile. Ian. On 7 November 2011 18:04, Antonio Cuni <anto.cuni@gmail.com> wrote:
Hello Ian,
On 25/07/11 11:00, Ian Ozsvald wrote:
Dear all, I've published v0.2 of my High Performance Python tutorial write-up from the session I ran at EuroPython:
http://ianozsvald.com/2011/07/25/high-performance-python-tutorial-v0-2-from-...
today I and Armin investigated a bit more about the performances of the mandelbrot algorithm that you wrote for your tutorial. What we found is very interesting :-).
We compared three versions of the code:
- a (slightly modified) pure python one on PyPy - the Cython one using calculate_z.pyx_2_bettermath - the shedskin one, using shedskin2.py
The PyPy version looks like this:
def calculate_z_serial_purepython(q, maxiter, z): """Pure python with complex datatype, iterating over list of q and z""" output = [0] * len(q) for i in range(len(q)): zi = z[i] qi = q[i] for iteration in range(maxiter): zi = zi * zi + qi if (zi.real*zi.real + zi.imag*zi.imag) > 4.0: output[i] = iteration break return output
i.e., it is exactly the same as pure_python_2.py, but we avoid to use abs(zi), so it is comparable with the cython and shedskin version.
First, we ran the programs to calculate passing "1000 1000" as arguments, and these are the results:
PyPy: 1.95 secs Cython: 0.58 secs Shedskin: 0.42 secs
so, PyPy is ~4.5x slower than Shedskin.
However, we realized that using the default values for x1,x2,y1,y2, the innermost loop runs very few iterations most of the time, and this is one case in which PyPy suffer most, because it needs to go through a bridge to continue the execution, and at the moment bridges are slower than loops.
So, we changed the values of x1,x2,y1,y2 to compute a different region, in which the innermost loop runs more frequently. We used these values: x1, x2, y1, y2 = 0.37865401-0.02, 0.37865401+0.02, 0.669227668-0.02, 0.669227668+0.02
and since all programs are faster to compute the image, we used "3000 3000" as arguments from the command line. These are the results:
PyPy: 0.89 Cython: 1.76 Shedskin: 0.26
So, in this case, PyPy is ~2x faster than Cython and ~3.5x slower than Shedskin.
In the meantime, Armin wrote a C version of it: http://paste.pocoo.org/raw/504216/
which tooks 0.946 seconds to complete. This is in line with the PyPy's result, but we are still investigating why the shedskin's version is so much faster.
ciao, Anto
-- Ian Ozsvald (A.I. researcher) ian@IanOzsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://StrongSteam.com/ http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald
Hi, On Tue, Nov 15, 2011 at 15:54, Ian Ozsvald <ian@ianozsvald.com> wrote:
ShedSkin (from memory) requests fast math and a few other things in the generated Makefile.
Ah, it is cheating that way. Indeed, I didn't try to play with gcc options; I just used -O2 (or -O3, which made no difference). The C source code is completely obvious and surprise-less. You can see it here (it outputs the raw data to stdout, so you have to pipe it to a converter program to display the result): http://paste.pocoo.org/show/508215/ A bientôt, Armin.
Hi, 2011/11/15 Armin Rigo <arigo@tunes.org>:
On Tue, Nov 15, 2011 at 15:54, Ian Ozsvald <ian@ianozsvald.com> wrote:
ShedSkin (from memory) requests fast math and a few other things in the generated Makefile.
Ah, it is cheating that way. Indeed, I didn't try to play with gcc options; I just used -O2 (or -O3, which made no difference).
FYI, here is the default FLAGS file for shedskin: CC=g++ CCFLAGS=-O2 -march=native -Wno-deprecated $(CPPFLAGS) LFLAGS=-lgc -lpcre $(LDFLAGS) But of course you can change the compiler and play with its flags to improve performance. Best regards, -- Jérémie
From memory the 'native' flag made a difference (I think it allows use of SSE?). I guess that is something I'll normalise for a future v0.3 release of my handbook :-) Cheers, Ian.
2011/11/15 Jérémie Roquet <arkanosis@gmail.com>:
Hi,
2011/11/15 Armin Rigo <arigo@tunes.org>:
On Tue, Nov 15, 2011 at 15:54, Ian Ozsvald <ian@ianozsvald.com> wrote:
ShedSkin (from memory) requests fast math and a few other things in the generated Makefile.
Ah, it is cheating that way. Indeed, I didn't try to play with gcc options; I just used -O2 (or -O3, which made no difference).
FYI, here is the default FLAGS file for shedskin:
CC=g++ CCFLAGS=-O2 -march=native -Wno-deprecated $(CPPFLAGS) LFLAGS=-lgc -lpcre $(LDFLAGS)
But of course you can change the compiler and play with its flags to improve performance.
Best regards,
-- Jérémie
-- Ian Ozsvald (A.I. researcher) ian@IanOzsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://StrongSteam.com/ http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald
2011/11/16 Ian Ozsvald <ian@ianozsvald.com>:
From memory the 'native' flag made a difference (I think it allows use of SSE?).
It depends on the machine, of course, but yes, on most machines it enables SSE. Just compare the output of $ < /dev/null g++ -E -v - |& grep cc1 and $ < /dev/null g++ -march=native -E -v - |& grep cc1 The following flags are added for me : -march=core2 -mcx16 -msahf --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=4096 -mtune=core2 -mcx16 and -msahf enable some additional instructions (see http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html) -mtune=core2 enables Intel's 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 Best regards, -- Jérémie
participants (5)
-
Antonio Cuni
-
Armin Rigo
-
Ian Ozsvald
-
Jérémie Roquet
-
Maciej Fijalkowski