Hi, On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:
Sturla Molden <sturla.molden@gmail.com> wrote:
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.
To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark
So what percentage on performance did you achieve so far?
I finally read this paper:
http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf
and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)
It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.
I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?
Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something.
The problem is the time and mental energy involved in the proof-of-concept may be enough to prevent it being done, and having some money to pay for time and to placate employers may be useful in overcoming that. To be clear - not me - I will certainly help if I can, but being paid isn't going to help me work on this. Cheers, Matthew