On 11.04.2014 19:05, Sturla Molden wrote:
Sturla Molden <sturla.molden@gmail.com> wrote:
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.
To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark magic, just plain C99 compiled with Intel icc, that would be sufficient for binary wheels on Windows I think.
hi, if you can, also give gcc with graphite a try. Its loop transformations should give you similar results as manual blocking if the compiler is able to understand the loop, see http://gcc.gnu.org/gcc-4.4/changes.html -floop-strip-mine -floop-block -floop-interchange + a couple options to tune the parameters you may need gcc-4.8 for it to work properly on not compile time fixed loop iteration counts. So far i know clang/llvm also has graphite integration. Cheers, Julian