On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.
To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark
So what percentage on performance did you achieve so far?
I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org