[Numpy-discussion] Cython-based OpenMP-accelerated quartic polynomial solver

Wed Sep 30 19:53:52 EDT 2015

On 30.09.2015 19:20, Chris Barker wrote:
> On Wed, Sep 30, 2015 at 6:57 AM, Nathan Goldbaum 
> <nathan12343 at gmail.com <mailto:nathan12343 at gmail.com>> wrote:
>
>         Note however that with the current version of the code, not
>         having OpenMP will severely limit the performance, especially
>         on quad-core machines, since an important factor in the speed
>         comes from the parallel processing of the independent problem
>         instances.
>
>
> It seems you got much more than 4 times performance improvement

True, good point. That's the power of a short explicit algorithm for a 
specific problem :)

(And maybe some Cython, too.)

> -- which is the most you'll get from four cores, presumably :-). not 
> that another 4 times or so isn't a good thing.

True in general :)

To be precise, in this case, HyperThreading seems to offer some extra 
performance.

And there seems to be something wonky with the power management on my 
laptop. The number I posted (1.2e7 quartics per second) seemed slightly 
low compared to earlier informal benchmarking I had done, and indeed, 
benchmarking with 8 threads again I'm now getting 1.6e7 quartics per 
second (i7-4710MQ at 2.5 GHz, RAM at 1333 MHz, running on Linux Mint 17.2).

Here are some more complete benchmark results. Because the polynomials 
to be solved are randomly generated during each run of the program, 
unless otherwise noted, I've rerun the benchmark three times for each 
number of threads and picked the lowest-performance result, rounded down.

Number of threads vs. quartics solved per second:

1: 3.0e6
2: 6.1e6
4: 1.1e7
8: 1.6e7  (note: higher than previously!)

Threads vs. cubics:

1: 8.4e6
2: 1.6e7
4: 3.0e7
8: 4.6e7

Threads vs. quadratics:

1: 2.5e7
2: 4.8e7
4: 8.5e7 (typical across 9 benchmarks; lowest 5.7e7)
8: 9.5e7

 From these, in general the weak scaling seems pretty good, and for 
quadratics and cubics, HyperThreading offers some extra performance, 
namely about 50% over the 4-thread result in both cases. The total 
speedup for quartics is 5.3x (3.6x without HT), and for cubics, 5.4x 
(3.5x without HT).

"About 4x, as expected" is still a good characterization, though :)

For quadratics, 4 threads seems to be the useful maximum; improvement 
with HT is only 11%, with a total speedup of 3.8x (vs. 3.4x without HT). 
These require much fewer arithmetic operations than the higher-order 
cases, so a possible explanation is that this case is memory-bound. 
Running the numbers on a simplistic estimate, the total of input and 
output is far from hitting the DDR3 bandwidth limit, but of course 
that's ignoring (at least) latencies and the pattern of data accesses.

Testing the completely silly "linear" case that is there for documenting 
code structure only, I get:

1: 1.9e8
2: 3.1e8
4: 3.6e8
8: 2.9e8

which should give a reasonable first approximation for the maximum 
practical throughput on this machine, when performing almost no arithmetic.

Anyway, returning from the sidetrack...

>         But I suppose there may be another technical option to support
>         multiple cores
>
>
> python threads with nogil?

...maybe?

I hadn't thought about this option. I suppose that in pseudocode, the 
code structure would need to be something like this?

---8<---8<---8<---

from threading import Thread

cdef fast_solver(...) nogil:
     # ...do stuff without touching Python objects...

def my_thread_entry(j_start, j_end):
     cdef unsigned int j
     with nogil:
         for j in range(j_start, j_end):
             fast_solver(...)  # process local slice

t1 = Thread( target=my_thread_entry, args=(0, j_split) )
t2 = Thread( target=my_thread_entry, args=(j_split, j_end_global) )
t1.start()
t2.start()
t1.join()
t2.join()

---8<---8<---8<---

  -J

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151001/06fa264e/attachment.html>