[Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Fri Apr 11 12:53:30 EDT 2014

Nathaniel Smith <njs at pobox.com> wrote:

> I unfortunately don't have the skills to actually lead such an effort
> (I've never written a line of asm in my life...), but surely our
> collective communities have people who do?

The assembly part in OpenBLAS/GotoBLAS is the major problem. Not just that
it's AT&T syntax (i.e. it requires MinGW to build on Windows), but also
that it sopports a wide range of processors. We just need a fast BLAS we
can use on Windows binary wheels (and possibly Mac OS X). There is no need
to support anything else than x86 and AMD64 architectures. So in theory one
could throw out all assembly and rewrite the kernels with compiler
intrinsics for various SIMD architectures. Or one just rely on the compiler
to autovectorize. Just program the code so it is easily vectorized. If we
manually unroll loops properly, and make sure the compiler is hinted about
memory alignment and pointer aliasing, the compiler will know what to do. 

There is already a reference BLAS implementation at Netlib, which we could
translate to C and optimize for SIMD. Then we  need a fast threadpool. I
have one I can donate, or we could use libxdispatch (a port of Apple's
libdispatch, aka GCD, to Windows as Linux.) Even Intel could not make their
TBB perform better than libdispatch. And that's about what we need. Or we
could start with OpenBLAS and throw away everything we don't need. 

Making a totally new BLAS might seem like a crazy idea, but it might be the
best solution in the long run. 

Sturla