[Numpy-discussion] numpy with ACML

Wed Feb 1 15:04:01 EST 2006

> There is code for that on netlib:
> http://www.netlib.org/blas/blast-forum/cblas.tgz
>
> I used it myself for my C code before and it worked just fine.
>
> Piotr

Piotr,
Thanks. I got numpy to work using the cblas & acml. Details at the  
bottom of the email.
I then ran the bench.py tests on numpy [1 processor Opteron ?1.8 GHZ]  
and got slightly unexpected answers:

  numpy times given both linked to cblas+acml and not linked. Neither  
of numarray, Numeric  linked to any blas:
python bench.py
Tests    x.T*y   x*y.T     A*x     A*B   A.T*x    half    2in2
Dimension: 5
Array   0.5700  0.1600  0.1200  0.1600  0.6200  0.4300  0.4800 --acml 
+cblas
Matrix  3.1000  0.9300  0.4000  0.4600  0.6500  1.7000  2.6200--acml 
+cblas
Array   0.6400  0.1700  0.1500  0.1800  0.6100  0.3600  0.4000
Matrix  3.2300  0.6900  0.4100  0.4600  0.6700  1.4900  2.3400
NumArr  1.2100  2.8500  0.2700  2.8600  5.0000  4.1100  6.8300
Numeri  0.7300  0.1800  0.1600  0.2000  0.4100  0.3300  0.4300

Dimension: 50
Array   5.9200  0.8400  0.2900  6.9300  8.0900  2.3600  2.4500--acml 
+cblas
Matrix 30.5500  1.8500  0.6000  7.4500  0.9300  3.7100  4.6400--acml 
+cblas
Array   6.5900  2.7100  0.7500 25.3100  8.5000  0.5600  0.6100
Matrix 32.5200  3.2600  1.0200 25.6100  1.2900  1.7400  2.5900
NumArr 12.6600  3.9700  0.7400 27.7900  6.4900  4.5500  7.1900
Numeri  7.9700  1.5000  0.6500 24.2700  7.4200  0.6000  2.3200

Dimension: 500
Array   0.9800  3.2900  0.6100 65.0000 10.8600  2.3100  2.5500--acml 
+cblas
Matrix  3.5300  3.3500  0.6400 64.9300  0.6500  2.3300  2.6100--acml 
+cblas
Array   1.0900  4.5600  0.8300 589.0000 11.0700  0.1300  0.2600
Matrix  3.7000  4.5800  0.8400 593.7300  1.1700  0.1300  0.3200
NumArr  1.6700  3.3100  0.7700 417.5600  4.3900  0.8500  1.1000
Numeri  1.1900  3.5200  0.7800 559.8100  9.7400  0.8000  2.4100

-- acml+blas indeed speeds up matrix multiplication by factor of 10.

but
  --doesn't really help vector dot products.
  --slows down searching operations half, 2in2 by factor of 10.

Matrices generally much slower than arrays, except for A.T*x, which  
is ~10x faster for matrices.

I also tried with the goto blas library linked in with cblas. Similar  
results, except slightly faster x.T*y. But trickier to get linked.

--George Nurser

------------------------------------------------------------------------ 
----------------------------------------------------

making the cblas.a library was straightforward. I just changed the  
flags in Makefile.LINUX to:

CFLAGS = -O3 -DADD_ -pthread -fno-strict-aliasing -m64 -msse2 - 
mfpmath=sse -march=opteron -fPIC
FFLAGS =  -Wall -fno-second-underscore -fPIC -O3 -funroll-loops - 
march=opteron -mmmx -msse2 -msse -m3dnow
RANLIB = ranlib
BLLIB = where libacml.so lives/libacml.so

then link Makefile.LINUX to Makefile.in and make.

The resulting cblas.a must then be moved or linked to libcblas.a in  
the *same* directory as the libacml.so.

This directory then needs to be added to the $LD_LIBRARY_PATH if it  
is not a standard one.

I needed a site.cfg in numpy/numpy/distutils/site.cfg as follows:
[blas]
blas_libs = cblas, acml
library_dirs = where libacml.so lives
include_dirs =  where cblas.h lives

[lapack]
language = f77
lapack_libs = acml
library_dirs = where libacml.so lives
include_dirs = where acml *.h live

Then numpy and scipy both seem to build fine. numpy passes  
t=numpy.test(), scipy passes scipy.test(level=10).