[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sat Mar 22 21:48:47 EDT 2008

On Sat, Mar 22, 2008 at 7:35 PM, Scott Ransom <sransom at nrao.edu> wrote:

> Here are results under 64-bit linux using gcc-4.3 (which by
> default turns on the various sse flags).  Note that -O3 is
> significantly better than -O2 for the "simple" calls:
>
> nimrod:~$ cat /proc/cpuinfo | grep "model name" | head -1
> model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
>
> nimrod:~$ gcc-4.3 --version
> gcc-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
>
> nimrod:~$ gcc-4.3 -O2 vec_bench.c -o vec_bench
> nimrod:~$ ./vec_bench
> Testing methods...
> All OK
> Problem size     Simple              Intrin              Inline
>      100   0.0001ms (100.0%)   0.0001ms ( 70.8%)   0.0001ms ( 74.3%)
>    1000   0.0008ms (100.0%)   0.0006ms ( 70.3%)   0.0007ms ( 80.3%)
>   10000   0.0085ms (100.0%)   0.0061ms ( 72.0%)   0.0067ms ( 78.8%)
>  100000   0.0882ms (100.0%)   0.0627ms ( 71.1%)   0.0677ms ( 76.7%)
>  1000000   3.6748ms (100.0%)   3.3312ms ( 90.7%)   3.3139ms ( 90.2%)
> 10000000  37.1154ms (100.0%)  35.9762ms ( 96.9%)  36.1126ms ( 97.3%)
>
> nimrod:~$ gcc-4.3 -O3 vec_bench.c -o vec_bench
> nimrod:~$ ./vec_bench
> Testing methods...
> All OK
> Problem size     Simple              Intrin              Inline
>      100   0.0001ms (100.0%)   0.0001ms (111.1%)   0.0001ms (116.7%)
>    1000   0.0005ms (100.0%)   0.0006ms (111.3%)   0.0007ms (126.8%)
>   10000   0.0056ms (100.0%)   0.0061ms (108.6%)   0.0067ms (118.9%)
>  100000   0.0581ms (100.0%)   0.0626ms (107.8%)   0.0677ms (116.5%)
>  1000000   3.4549ms (100.0%)   3.3339ms ( 96.5%)   3.3255ms ( 96.3%)
> 10000000  34.8186ms (100.0%)  35.9767ms (103.3%)  36.1099ms (103.7%)
>
>
> nimrod:~$ ./vec_bench_dbl
> Testing methods...
> All OK
> Problem size              Simple              Intrin
>         100   0.0001ms (100.0%)   0.0001ms (132.5%)
>        1000   0.0009ms (100.0%)   0.0012ms (134.5%)
>       10000   0.0119ms (100.0%)   0.0124ms (104.1%)
>      100000   0.1226ms (100.0%)   0.1276ms (104.1%)
>     1000000   7.0047ms (100.0%)   6.6654ms ( 95.2%)
>    10000000  70.0060ms (100.0%)  71.9692ms (102.8%)
>
> nimrod:~$ gcc-4.3 -O3 vec_bench_dbl.c -o vec_bench_dbl
> nimrod:~$ ./vec_bench_dbl
> Testing methods...
> All OK
> Problem size              Simple              Intrin
>         100   0.0001ms (100.0%)   0.0002ms (289.8%)
>        1000   0.0007ms (100.0%)   0.0012ms (172.7%)
>       10000   0.0114ms (100.0%)   0.0124ms (109.4%)
>      100000   0.1159ms (100.0%)   0.1278ms (110.3%)
>     1000000   6.9252ms (100.0%)   6.6585ms ( 96.1%)
>    10000000  69.1913ms (100.0%)  71.9664ms (104.0%)

It looks to me like the best approach here is to generate operator specific
loops for arithmetic, then check the step size in the loop for contiguous
data, and if found branch to a block where the pointers have been cast to
the right type. The loop itself could even check for operator type by
switching on the function address so that the code modifications could be
localized. The compiler can do the rest.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/d9531d3e/attachment.html>