Very slow Julia example on PyPy/numpy - can you help me understand why it is slow?
Hello all. I'm talking at PyData London this Saturday on the 'high performance landscape', pypy and pypy/numpy get discussed. PyPy and lists works great on my Julia example, the pypy/numpy version is very poor with respect to other solutions (cython, pythran, numba). Can you help me understand what's wrong? I'd like to give a sensible answer to the audience. I'll also note in the talk that this result is still 10* faster than Python+numpy used the same way. Sidenote - on a separate example (for my book) using Monte Carlo Pi estimation with a 1e7 vector, pypy/numpy is 2* slower than python/numpy (which obviously isn't bad), so I'm fairly confident my pypy/numpy code is working correctly. This example uses a vectorised operation. I'm using a fresh daily build of pypy and a fresh checkout of numpy. Here's the main function, I'll copy the full example down below: def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule""" for i in range(len(zs)): # range, xrange - same speed n = 0 z = zs[i] c = cs[i] # expanding the math make it 2 seconds slower #while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: # using a for loop in place of while makes it 1 second slower # using direct array references (removing z,c) does not change speed while n < maxiter and abs(z) < 2: z = z * z + c n += 1 output[i] = n I pass in 2 np arrays as complex128 (zs has co-ords, cs is a constant - this is a consistent example in my book which looks at cpu and memory costs etc). I iterate with a while loop doing some simple math. Cython+numpy takes 0.19s. (similar results for numba, pythran) PyPy+numpy takes 5s. Python+numpy takes 54s (due to dereference cost on each array access) Is the problem related to dereferencing? I'm not doing a vectorised operation for this example. I know that dereferencing each element using Python+numpy is horribly slow. Ian. Full code: """Julia set generator without optional PIL-based image drawing""" import time import numpy as np # area of complex space to investigate x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8 c_real, c_imag = -0.62772, -.42193 def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule""" for i in range(len(zs)): # range, xrange - same speed n = 0 z = zs[i] c = cs[i] # expanding the math make it 2 seconds slower #while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: # using a for loop in place of while makes it 1 second slower # using direct array references (removing z,c) does not change speed while n < maxiter and abs(z) < 2: z = z * z + c n += 1 output[i] = n def calc_pure_python(desired_width, max_iterations): """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display""" x_step = (float(x2 - x1) / float(desired_width)) y_step = (float(y1 - y2) / float(desired_width)) x = [] y = [] ycoord = y2 while ycoord > y1: y.append(ycoord) ycoord += y_step xcoord = x1 while xcoord < x2: x.append(xcoord) xcoord += x_step # build a list of co-ordinates and the initial condition for each cell. # Note that our initial condition is a constant and could easily be removed, # we use it to simulate a real-world scenario with several inputs to our function zs = [] cs = [] for ycoord in y: for xcoord in x: zs.append(complex(xcoord, ycoord)) cs.append(complex(c_real, c_imag)) zs_np = np.array(zs, np.complex128) cs_np = np.array(cs, np.complex128) print "Length of x:", len(x) print "Total elements:", len(zs) start_time = time.time() output = np.empty(len(zs), dtype=np.int64) calculate_z(max_iterations, zs_np, cs_np, output) end_time = time.time() secs = end_time - start_time print "Took", secs, "seconds" validation_sum = sum(output) print "Total sum of elements (for validation):", validation_sum # Calculate the Julia set using a pure Python solution with # reasonable defaults for a laptop calc_pure_python(desired_width=1000, max_iterations=300) -- Ian Ozsvald (A.I. researcher) ian@IanOzsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com
Hi Ian, On 20 February 2014 12:53, Ian Ozsvald <ian@ianozsvald.com> wrote:
def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule"""
This particular example uses numpy is a very strange and useless way, as I'm sure you know. It builds a regular list of 1'000'000 items; then it converts it to a pair of numpy arrays; then it iterates over the array. It's obviously better to just iterate over the original list (also in CPython). But I know it's not really the point; the point is rather that numpy is slow with PyPy, slower than we would expect. This is known, basically, but a good reminder that we need to look at it from the performance point of view. So far, we focused on completeness. [ Just for reference, I killed numpy from your example and got a 4x speed-up (down from 5s to 1.25s). Afterwards, I expanded the math:
# expanding the math make it 2 seconds slower #while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
which is good in theory because abs() requires a square root. As it turns out, it is very good indeed. This results in another 5x speed-up, to 0.25s. This is close enough from Cython speed (which is probably mostly gcc's speed in this example) that I'd say we are done. ] A bientôt, Armin.
Hi Armin. The point of the question was not to remove numpy but to understand the behaviour :-) I've already done a set of benchmarks with lists and with numpy, I've copied the results below. I'm using the same Julia code throughout (there's a note about the code below). PyPy on lists is indeed very compelling. One observation I've made of beginners (and I did the same) is that iterating over numpy arrays seems natural until you learn it is horribly slow. The you learn to vectorise. Some of the current tools handle the non-vectorised case really well and that's something I want to mention. For Julia I've used lists and numpy. Using a numpy list rather than an `array` makes sense as arrays won't hold a complex type (and messing with decomposing the complex elements into two arrays gets even sillier) and the example is still trivial for a reader to understand. numpy arrays (and Python arrays) are good because they use much less RAM than big lists. The reason why my example code above made lists and then turned them into numpy arrays...that's because I was lazy and hadn't finished tidying this demo (my bad!). My results on the Julia code (4 core i7) for reference: lists: CPython 12s Pythran 0.4s # 1 line of annotation PyPy 0.3s # no changes required, my recommendation in my talk Cython 0.2s # lots of annotation required ShedSkin 0.22s # basically the same result as Cython plus overhead of a 1e6 element memory copy numpy: (OMP = OpenMP with dynamic scheduling) PyPy 5s Numba 0.4s (I couldn't find prange support in v0.12) Cython 0.16s Pythran OMP 0.1s Cython OMP 0.05s I don't mind that my use of numpy is silly, I'm just curious to understand why pypynumpy diverges from the results of the other compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran? Cheers, i. On 20 February 2014 18:26, Armin Rigo <arigo@tunes.org> wrote:
Hi Ian,
On 20 February 2014 12:53, Ian Ozsvald <ian@ianozsvald.com> wrote:
def calculate_z(maxiter, zs, cs, output): """Calculate output list using Julia update rule"""
This particular example uses numpy is a very strange and useless way, as I'm sure you know. It builds a regular list of 1'000'000 items; then it converts it to a pair of numpy arrays; then it iterates over the array. It's obviously better to just iterate over the original list (also in CPython). But I know it's not really the point; the point is rather that numpy is slow with PyPy, slower than we would expect. This is known, basically, but a good reminder that we need to look at it from the performance point of view. So far, we focused on completeness.
[ Just for reference, I killed numpy from your example and got a 4x speed-up (down from 5s to 1.25s). Afterwards, I expanded the math:
# expanding the math make it 2 seconds slower #while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
which is good in theory because abs() requires a square root. As it turns out, it is very good indeed. This results in another 5x speed-up, to 0.25s. This is close enough from Cython speed (which is probably mostly gcc's speed in this example) that I'd say we are done. ]
A bientôt,
Armin.
-- Ian Ozsvald (A.I. researcher) ian@IanOzsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com
Hi Ian, On 20 February 2014 21:40, Ian Ozsvald <ian@ianozsvald.com> wrote:
compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran?
Indeed, it's because numpypy is young and not optimized so far. Note also that while big improvements are certainly possible and planned for this simple example, making consistent big improvements for more cases requires some serious involvement. Unlike other projects like Numba, PyPy is a more diverse project, and right now only one of the few core people is working part-time on numpypy. As a general rule, I'd say that we're missing someone really interested in numpy. A bientôt, Armin.
Hello Ian, Le 20/02/14 20:40, Ian Ozsvald a écrit :
Hi Armin. The point of the question was not to remove numpy but to understand the behaviour :-) I've already done a set of benchmarks with lists and with numpy, I've copied the results below. I'm using the same Julia code throughout (there's a note about the code below). PyPy on lists is indeed very compelling.
One observation I've made of beginners (and I did the same) is that iterating over numpy arrays seems natural until you learn it is horribly slow. The you learn to vectorise. Some of the current tools handle the non-vectorised case really well and that's something I want to mention.
For Julia I've used lists and numpy. Using a numpy list rather than an `array` makes sense as arrays won't hold a complex type (and messing with decomposing the complex elements into two arrays gets even sillier) and the example is still trivial for a reader to understand. numpy arrays (and Python arrays) are good because they use much less RAM than big lists. The reason why my example code above made lists and then turned them into numpy arrays...that's because I was lazy and hadn't finished tidying this demo (my bad!).
I agree that your code looks rather sensible (at least, to people who haven't internalised yet all the "stupid" implementation details concerning arrays, lists, iteration and vectorisation). So it's a bit of a shame that PyPy doesn't do better.
I don't mind that my use of numpy is silly, I'm just curious to understand why pypynumpy diverges from the results of the other compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran?
To answer such questions, the best way is to use the jitviewer (https://bitbucket.org/pypy/jitviewer ). Looking at the trace for the inner loop, I can see every operation on a scalar triggers a dict lookup to obtain its dtype. This looks like self-inflicted pain coming the broken objspace abstraction rather than anything fundamental. Fixing that should improve speed by about an order of magnitude. Cheers, Ronan
On Sat, Feb 22, 2014 at 7:45 PM, Ronan Lamy <ronan.lamy@gmail.com> wrote:
Hello Ian,
Le 20/02/14 20:40, Ian Ozsvald a écrit :
Hi Armin. The point of the question was not to remove numpy but to understand the behaviour :-) I've already done a set of benchmarks with lists and with numpy, I've copied the results below. I'm using the same Julia code throughout (there's a note about the code below). PyPy on lists is indeed very compelling.
One observation I've made of beginners (and I did the same) is that iterating over numpy arrays seems natural until you learn it is horribly slow. The you learn to vectorise. Some of the current tools handle the non-vectorised case really well and that's something I want to mention.
For Julia I've used lists and numpy. Using a numpy list rather than an `array` makes sense as arrays won't hold a complex type (and messing with decomposing the complex elements into two arrays gets even sillier) and the example is still trivial for a reader to understand. numpy arrays (and Python arrays) are good because they use much less RAM than big lists. The reason why my example code above made lists and then turned them into numpy arrays...that's because I was lazy and hadn't finished tidying this demo (my bad!).
I agree that your code looks rather sensible (at least, to people who haven't internalised yet all the "stupid" implementation details concerning arrays, lists, iteration and vectorisation). So it's a bit of a shame that PyPy doesn't do better.
I don't mind that my use of numpy is silly, I'm just curious to understand why pypynumpy diverges from the results of the other compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran?
To answer such questions, the best way is to use the jitviewer (https://bitbucket.org/pypy/jitviewer ). Looking at the trace for the inner loop, I can see every operation on a scalar triggers a dict lookup to obtain its dtype. This looks like self-inflicted pain coming the broken objspace abstraction rather than anything fundamental. Fixing that should improve speed by about an order of magnitude.
Cheers, Ronan
Hi Ronan. You can't blame objspace for everything ;-) It looks like it's easily fixable. I'm in transit right now but I can fix it once I'm home. Ian - please come with more broken examples, they usually come from stupid reasons! Cheers, fijal
FYI it's fixed On Sat, Feb 22, 2014 at 8:48 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Sat, Feb 22, 2014 at 7:45 PM, Ronan Lamy <ronan.lamy@gmail.com> wrote:
Hello Ian,
Le 20/02/14 20:40, Ian Ozsvald a écrit :
Hi Armin. The point of the question was not to remove numpy but to understand the behaviour :-) I've already done a set of benchmarks with lists and with numpy, I've copied the results below. I'm using the same Julia code throughout (there's a note about the code below). PyPy on lists is indeed very compelling.
One observation I've made of beginners (and I did the same) is that iterating over numpy arrays seems natural until you learn it is horribly slow. The you learn to vectorise. Some of the current tools handle the non-vectorised case really well and that's something I want to mention.
For Julia I've used lists and numpy. Using a numpy list rather than an `array` makes sense as arrays won't hold a complex type (and messing with decomposing the complex elements into two arrays gets even sillier) and the example is still trivial for a reader to understand. numpy arrays (and Python arrays) are good because they use much less RAM than big lists. The reason why my example code above made lists and then turned them into numpy arrays...that's because I was lazy and hadn't finished tidying this demo (my bad!).
I agree that your code looks rather sensible (at least, to people who haven't internalised yet all the "stupid" implementation details concerning arrays, lists, iteration and vectorisation). So it's a bit of a shame that PyPy doesn't do better.
I don't mind that my use of numpy is silly, I'm just curious to understand why pypynumpy diverges from the results of the other compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran?
To answer such questions, the best way is to use the jitviewer (https://bitbucket.org/pypy/jitviewer ). Looking at the trace for the inner loop, I can see every operation on a scalar triggers a dict lookup to obtain its dtype. This looks like self-inflicted pain coming the broken objspace abstraction rather than anything fundamental. Fixing that should improve speed by about an order of magnitude.
Cheers, Ronan
Hi Ronan.
You can't blame objspace for everything ;-) It looks like it's easily fixable. I'm in transit right now but I can fix it once I'm home. Ian - please come with more broken examples, they usually come from stupid reasons!
Cheers, fijal
Much obliged everyone. I've noted all these points in my post about my high performance talk with updates linking back to these threads: http://ianozsvald.com/2014/02/23/high-performance-python-at-pydatalondon-201... Re. Ronan - I'll use jitviewer, I just haven't had time. I hope to cover it in the JIT section of my book (in a month or so). On 25 February 2014 11:36, Maciej Fijalkowski <fijall@gmail.com> wrote:
FYI it's fixed
On Sat, Feb 22, 2014 at 8:48 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Sat, Feb 22, 2014 at 7:45 PM, Ronan Lamy <ronan.lamy@gmail.com> wrote:
Hello Ian,
Le 20/02/14 20:40, Ian Ozsvald a écrit :
Hi Armin. The point of the question was not to remove numpy but to understand the behaviour :-) I've already done a set of benchmarks with lists and with numpy, I've copied the results below. I'm using the same Julia code throughout (there's a note about the code below). PyPy on lists is indeed very compelling.
One observation I've made of beginners (and I did the same) is that iterating over numpy arrays seems natural until you learn it is horribly slow. The you learn to vectorise. Some of the current tools handle the non-vectorised case really well and that's something I want to mention.
For Julia I've used lists and numpy. Using a numpy list rather than an `array` makes sense as arrays won't hold a complex type (and messing with decomposing the complex elements into two arrays gets even sillier) and the example is still trivial for a reader to understand. numpy arrays (and Python arrays) are good because they use much less RAM than big lists. The reason why my example code above made lists and then turned them into numpy arrays...that's because I was lazy and hadn't finished tidying this demo (my bad!).
I agree that your code looks rather sensible (at least, to people who haven't internalised yet all the "stupid" implementation details concerning arrays, lists, iteration and vectorisation). So it's a bit of a shame that PyPy doesn't do better.
I don't mind that my use of numpy is silly, I'm just curious to understand why pypynumpy diverges from the results of the other compiler technologies. The simple answer might be 'because pypynumpy is young' and that'd be fine - at least I'd have an answer if someone asks the question in my talk. If someone has more details, that'd be really interesting too. Is there a fundamental reason why pypynumpy couldn't do the example as fast as cython/numba/pythran?
To answer such questions, the best way is to use the jitviewer (https://bitbucket.org/pypy/jitviewer ). Looking at the trace for the inner loop, I can see every operation on a scalar triggers a dict lookup to obtain its dtype. This looks like self-inflicted pain coming the broken objspace abstraction rather than anything fundamental. Fixing that should improve speed by about an order of magnitude.
Cheers, Ronan
Hi Ronan.
You can't blame objspace for everything ;-) It looks like it's easily fixable. I'm in transit right now but I can fix it once I'm home. Ian - please come with more broken examples, they usually come from stupid reasons!
Cheers, fijal
pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- Ian Ozsvald (A.I. researcher) ian@IanOzsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com
participants (4)
-
Armin Rigo
-
Ian Ozsvald
-
Maciej Fijalkowski
-
Ronan Lamy