OK, I've made it work. I was messing the row/column order mapping and the transpose param. However, it is still slower than cblas. After having replaced all the cblas_xxx calls in my code (about 8 of them), I get the following result cblas version 0.69 sec scipy blas 0.74 sec Any clue why?