I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly.  Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79.  Since the computation is trivial data parallelism, and I believe it's still pretty far off the memory bandwidth limit, I would expect a speedup of 1.95 or higher.<div>

<br></div><div>One reason I suggest TBB is that it can produce a pretty good schedule while still adapting to load produced by other processes and threads.  Numexpr currently does that well, but simply dividing the data into one piece per thread doesn't handle that case very well, and makes it possible that one thread spends a fair bit of time finishing up while the others idle at the end.  Perhaps using Cilk would be a better option than TBB, since the code could remain in C.</div>

<div><br></div><div>-Mark<br><br><div class="gmail_quote">On Mon, Jan 10, 2011 at 3:55 AM, Francesc Alted <span dir="ltr"><<a href="mailto:faltet@pytables.org">faltet@pytables.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:<br>

<div class="im">> Also, I'd like to try out the new thread scheduling that you<br>

> suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).<br>

<br>

</div>I've just implemented the new partition schema in numexpr<br>

(T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the<br>

patch for this.  The results are a bit confusing.  For example, using<br>

the attached benchmark (poly.py), I get these results for a common dual-<br>

core machine, non-NUMA machine:<br>

<br>

With the T0T1...T0T1... (original) schema:<br>

<br>

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points<br>

Using numpy:<br>

*** Time elapsed: 3.497<br>

Using numexpr:<br>

*** Time elapsed for 1 threads: 1.279000<br>

*** Time elapsed for 2 threads: 0.688000<br>

<br>

With the T0T0...T1T1... (new) schema:<br>

<br>

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points<br>

Using numpy:<br>

*** Time elapsed: 3.454<br>

Using numexpr:<br>

*** Time elapsed for 1 threads: 1.268000<br>

*** Time elapsed for 2 threads: 0.754000<br>

<br>

which is around a 10% slower (2 threads) than the original partition.<br>

<br>

The results are a bit different on a NUMA machine (8 physical cores, 16<br>

logical cores via hyper-threading):<br>

<br>

With the T0T1...T0T1... (original) partition:<br>

<br>

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points<br>

Using numpy:<br>

*** Time elapsed: 3.005<br>

Using numexpr:<br>

*** Time elapsed for 1 threads: 1.109000<br>

*** Time elapsed for 2 threads: 0.677000<br>

*** Time elapsed for 3 threads: 0.496000<br>

*** Time elapsed for 4 threads: 0.394000<br>

*** Time elapsed for 5 threads: 0.324000<br>

*** Time elapsed for 6 threads: 0.287000<br>

*** Time elapsed for 7 threads: 0.247000<br>

*** Time elapsed for 8 threads: 0.234000<br>

*** Time elapsed for 9 threads: 0.242000<br>

*** Time elapsed for 10 threads: 0.239000<br>

*** Time elapsed for 11 threads: 0.241000<br>

*** Time elapsed for 12 threads: 0.235000<br>

*** Time elapsed for 13 threads: 0.226000<br>

*** Time elapsed for 14 threads: 0.214000<br>

*** Time elapsed for 15 threads: 0.235000<br>

*** Time elapsed for 16 threads: 0.218000<br>

<br>

With the T0T0...T1T1... (new) partition:<br>

<br>

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points<br>

Using numpy:<br>

*** Time elapsed: 3.003<br>

Using numexpr:<br>

*** Time elapsed for 1 threads: 1.106000<br>

*** Time elapsed for 2 threads: 0.617000<br>

*** Time elapsed for 3 threads: 0.442000<br>

*** Time elapsed for 4 threads: 0.345000<br>

*** Time elapsed for 5 threads: 0.296000<br>

*** Time elapsed for 6 threads: 0.257000<br>

*** Time elapsed for 7 threads: 0.237000<br>

*** Time elapsed for 8 threads: 0.260000<br>

*** Time elapsed for 9 threads: 0.245000<br>

*** Time elapsed for 10 threads: 0.261000<br>

*** Time elapsed for 11 threads: 0.238000<br>

*** Time elapsed for 12 threads: 0.210000<br>

*** Time elapsed for 13 threads: 0.218000<br>

*** Time elapsed for 14 threads: 0.200000<br>

*** Time elapsed for 15 threads: 0.235000<br>

*** Time elapsed for 16 threads: 0.198000<br>

<br>

In this case, the performance is similar, with perhaps a slight<br>

advantage for the new partition scheme, but I don't know if it is worth<br>

to make it the default (probably not, as this partition performs clearly<br>

worse on non-NUMA machines).  At any rate, both partitions perform very<br>

close to the aggregated memory bandwidth of NUMA machines (around 10<br>

GB/s in the above case).<br>

<br>

In general, I don't think there is much point in using Intel's TBB in<br>

numexpr because the existing implementation already hits memory<br>

bandwidth limits pretty early (around 10 threads in the latter example).<br>

<br>

--<br>

<font color="#888888">Francesc Alted<br>

</font><br>_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

<br></blockquote></div><br></div>