Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

28 Apr 2014

      Hi,

On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith <njs@pobox.com> wrote:
...
On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
...
Hi,
On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:
...
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:
...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:
...
Sturla Molden <sturla.molden@gmail.com> wrote:
...
Making a totally new BLAS might seem like a crazy idea, but it might be the
best solution in the long run.
To see if this can be done, I'll try to re-implement cblas_dgemm and then
benchmark against MKL, Accelerate and OpenBLAS. If I can get the
performance better than 75% of their speed, without any assembly or dark
So what percentage on performance did you achieve so far?
I finally read this paper:
http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf
and I have to say that I'm no longer so convinced that OpenBLAS is the
right starting point. They make a compelling argument that BLIS *is*
the cleaned up, maintainable, and yet still competitive
reimplementation of GotoBLAS/OpenBLAS that we all want, and that
getting there required a qualitative reorganization of the code (i.e.,
very hard to do incrementally). But they've done it. And, I get the
impression that the stuff they're missing -- threading, cross-platform
build stuff, and runtime CPU adaptation -- is all pretty
straightforward stuff that is only missing because no-one's gotten
around to sitting down and implementing it. (In particular that paper
does include impressive threading results; it sounds like given a
decent thread pool library one could get competitive performance
pretty trivially, it's just that they haven't been bothered yet to do
thread pools properly or systematically test which of the pretty-good
approaches to threading is "best". Which is important if your goal is
to write papers about BLAS libraries but irrelevant to reaching
minimal-viable-product stage.)
It would be really interesting if someone were to try hacking simple
runtime CPU detection into BLIS and see how far you could get -- right
now they do kernel selection via the C preprocessor, but hacking in
some function pointer thing instead would not be that hard I think. A
maintainable library that builds on Linux/OSX/Windows, gets
competitive performance on last-but-one generation x86-64 CPUs, and
gets better-than-reference-BLAS performance everywhere else, would be
a very very compelling product that I bet would quickly attract the
necessary attention to make it competitive on all CPUs.
I wonder - is there anyone who might be able to do this work, if we
found funding for a couple of months to do it?
Not much point in worrying about this I think until someone tries a
proof of concept. But potentially even the labs working on BLIS would
be interested in a small grant from NumFOCUS or something.
The problem is the time and mental energy involved in the
proof-of-concept may be enough to prevent it being done, and having
some money to pay for time and to placate employers may be useful in
overcoming that.

To be clear - not me - I will certainly help if I can, but being paid
isn't going to help me work on this.

Cheers,

Matthew