Calling scipy blas from cython is extremely slow
Hi, following the excellent advice of V. Armando Sole, I have finally succeeded in calling the blas routines shipped with scipy from cython. I am doing this to avoid shipping an extra blas library for some project of mine that uses scipy but has some things coded in cython for extra speed. So far I managed getting things working on Linux. Here is what I do: The following code snippet gives me the dgemv pointer (which is a pointer to a fortran function, even if it comes from scipy.linalg.blas.cblas, weird). from cpython cimport PyCObject_AsVoidPtr import scipy as sp __import__('scipy.linalg.blas') ctypedef void (*dgemv_ptr) (char *trans, int *m, int *n,\ double *alpha, double *a, int *lda, double *x,\ int *incx,\ double *beta, double *y, int *incy) cdef dgemv_ptr dgemv=<dgemv_ptr>PyCObject_AsVoidPtr(\ sp.linalg.blas.cblas.dgemv._cpointer) Then, in a tight loop, I can call dgemv by first defining the constants and then calling dgemv inside the loop cdef int one=1 cdef double onedot = 1.0 cdef double zerodot = 0.0 cdef char trans = 'N' for i in xrange(N): dgemv(&trans, &nq, &order,\ &onedot, <double *>np.PyArray_DATA(C), &order, \ <double*>np.PyArray_DATA(c_x0), &one, \ &zerodot, <double*>np.PyArray_DATA(y0), &one) It works, but it is many many times slower than linking to the cblas that is available on the same system. Specifically, I have about 8 calls to blas in my tight loop, 4 of them are to dgemv and the others are to dcopy. Changing a single dgemv call from the system cblas to the blas function returned by scipy.linalg.blas.cblas.dgemv._cpointer makes the execution time of a test case jump from about 0.7 s to 1.25 on my system. Any clue about why is this happening? In the end, on linux, scipy dynamically link to atlas exactly as I link to atlas when I use the cblas functions.
... and it is not deterministic too... About 1 time over 6 the code calling the scipy blas gives a completely wrong result. How can this be?
Partially fixed. I was messing the row, column order. For some reason this was working in some case. Now I've fixed it and it *always* works. However, it is still slower than the cblas cblas -> 0.69 sec scipy blas -> 0.74 sec Any clue why?
On Sat, 23 Feb 2013 18:31:42 +0000 (UTC) Sergio Callegari <sergio.callegari@gmail.com> wrote:
However, it is still slower than the cblas
cblas -> 0.69 sec scipy blas -> 0.74 sec
if you are using scipy blas, the real question is which blas is underneath ? OpenBlas, GotoBlas, Atlas, MKL ? Under Debian I observed a x17 in speed from 35s to 2s with an "apt-get install atlas" on Armando's code. Cheers, -- Jérôme Kieffer Data analysis unit - ESRF
23.02.2013 20:31, Sergio Callegari kirjoitti:
Partially fixed.
I was messing the row, column order. For some reason this was working in some case. Now I've fixed it and it *always* works.
However, it is still slower than the cblas
cblas -> 0.69 sec scipy blas -> 0.74 sec
The possible explanations are that either the routine called is different in the two cases, or, the benchmark if somehow faulty. -- Pauli Virtanen
participants (3)
-
Jerome Kieffer
-
Pauli Virtanen
-
Sergio Callegari