Mailman 3 The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows) - NumPy-Discussion

newer
ANN: pandas 0.14.1 released

The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

older
numpy.partition and the ICC...

Nathaniel Smith

April 11, 2014

4:03 p.m.

On Fri, Apr 11, 2014 at 12:21 PM, Carl Kleffner <cmkleffner@gmail.com> wrote:

...

a discussion about OpenBLAS on the octave maintainer list: http://article.gmane.org/gmane.comp.gnu.octave.maintainers/38746

I'm getting the impression that OpenBLAS is being both a tantalizing opportunity and a practical thorn-in-the-side for everyone -- Python, Octave, Julia, R. How crazy would it be to get together an organized effort to fix this problem -- "OpenBLAS for everyone"? E.g., by collecting patches to fix the bits we don't like (like unhelpful build system defaults), applying more systematic QA, etc. Ideally this could be done upstream, but if upstream is MIA or disagrees about OpenBLAS's goals, then it could be maintained as a kind of "OpenBLAS++" that merges regularly from upstream (compare to [1][2][3] for successful projects handled in this way). If hardware for testing is a problem, then I suspect NumFOCUS would be overjoyed to throw a few kilodollars at buying one instance of each widely-distributed microarchitecture released in the last few years as a test farm... I think the goal is pretty clear: a modern optionally-multithreaded BLAS under a BSD-like license with a priority on correctness, out-of-the-box functionality (like runtime configurability and feature detection), speed, and portability, in that order. I unfortunately don't have the skills to actually lead such an effort (I've never written a line of asm in my life...), but surely our collective communities have people who do? -n [1] http://www.openssh.com/portable.html [2] http://www.eglibc.org/mission (a "friendly fork" of glibc holding stuff that Ulrich Drepper got cranky about, which eventually was merged back) [3] https://en.wikipedia.org/wiki/Go-oo -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Show replies by date

Sturla Molden

April 2014

4:53 p.m.

New subject: The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Nathaniel Smith <njs@pobox.com> wrote:

...

I unfortunately don't have the skills to actually lead such an effort (I've never written a line of asm in my life...), but surely our collective communities have people who do?

The assembly part in OpenBLAS/GotoBLAS is the major problem. Not just that it's AT&T syntax (i.e. it requires MinGW to build on Windows), but also that it sopports a wide range of processors. We just need a fast BLAS we can use on Windows binary wheels (and possibly Mac OS X). There is no need to support anything else than x86 and AMD64 architectures. So in theory one could throw out all assembly and rewrite the kernels with compiler intrinsics for various SIMD architectures. Or one just rely on the compiler to autovectorize. Just program the code so it is easily vectorized. If we manually unroll loops properly, and make sure the compiler is hinted about memory alignment and pointer aliasing, the compiler will know what to do. There is already a reference BLAS implementation at Netlib, which we could translate to C and optimize for SIMD. Then we need a fast threadpool. I have one I can donate, or we could use libxdispatch (a port of Apple's libdispatch, aka GCD, to Windows as Linux.) Even Intel could not make their TBB perform better than libdispatch. And that's about what we need. Or we could start with OpenBLAS and throw away everything we don't need. Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. Sturla

Sturla Molden

5:05 p.m.

New subject: The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Sturla Molden <sturla.molden@gmail.com> wrote:

...

Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark magic, just plain C99 compiled with Intel icc, that would be sufficient for binary wheels on Windows I think. Sturla

Julian Taylor

6:38 p.m.

On 11.04.2014 19:05, Sturla Molden wrote:

...

Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark magic, just plain C99 compiled with Intel icc, that would be sufficient for binary wheels on Windows I think.

hi, if you can, also give gcc with graphite a try. Its loop transformations should give you similar results as manual blocking if the compiler is able to understand the loop, see http://gcc.gnu.org/gcc-4.4/changes.html -floop-strip-mine -floop-block -floop-interchange + a couple options to tune the parameters you may need gcc-4.8 for it to work properly on not compile time fixed loop iteration counts. So far i know clang/llvm also has graphite integration. Cheers, Julian

Nathaniel Smith

6:47 p.m.

On Fri, Apr 11, 2014 at 6:05 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark magic, just plain C99 compiled with Intel icc, that would be sufficient for binary wheels on Windows I think.

Sounds like a worthwhile experiment! My suspicion is that it we'll be better off starting with something that is almost good enough (OpenBLAS) and then incrementally improving it to meet our needs, rather than starting from scratch -- there's a *long* way to get from dgemm to a fully supported BLAS project -- but no matter what it'll generate useful data, and possibly some useful code that could either be the basis of something new or integrated into whatever we do end up doing. Also, while Windows is maybe in the worst shape, all platforms would seriously benefit from the existence of a reliable speed-competitive binary-distribution-compatible BLAS that doesn't break fork(). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Sturla Molden

10:26 p.m.

On 11/04/14 20:47, Nathaniel Smith wrote:

...

Also, while Windows is maybe in the worst shape, all platforms would seriously benefit from the existence of a reliable speed-competitive binary-distribution-compatible BLAS that doesn't break fork().

Windows is worst off, yes. I don't think fork breakage by Accelerate is a big problem on Mac OS X. Apple has made clear that only POSIX APIs are fork safe. And actually this is now recognized as an error in multiprocessing and fixed in Python 3.4: multiprocessing.set_start_method('spawn') On Linux the distributions will usually ship with prebuilt ATLAS. Sturla

Nathaniel Smith

10:39 p.m.

On Fri, Apr 11, 2014 at 11:26 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 11/04/14 20:47, Nathaniel Smith wrote:

...
Also, while Windows is maybe in the worst shape, all platforms would seriously benefit from the existence of a reliable speed-competitive binary-distribution-compatible BLAS that doesn't break fork().

Windows is worst off, yes.

I don't think fork breakage by Accelerate is a big problem on Mac OS X. Apple has made clear that only POSIX APIs are fork safe. And actually this is now recognized as an error in multiprocessing and fixed in Python 3.4:

multiprocessing.set_start_method('spawn')

I don't really care whether it's *documented* that BLAS and fork are incompatible. I care whether it *works*, because it is useful functionality :-). The spawn mode is fine and all, but (a) the presence of something in 3.4 helps only a minority of users, (b) "spawn" is not a full replacement for fork; with large read-mostly data sets it can be a *huge* win to load them into the parent process and then let them be COW-inherited by forked children. ATM the only other way to work with a data set that's larger than memory-divided-by-numcpus is to explicitly set up shared memory, and this is *really* hard for anything more complicated than a single flat array.

...

On Linux the distributions will usually ship with prebuilt ATLAS.

And it's generally recommended that everyone rebuild their own ATLAS anyway. I can do it, but I'd much rather be able to install a BLAS library that just worked. (Presumably this is a large part of why scipy-stack distributors prefer MKL over ATLAS.) If it comes down to it then of course I'd rather have a Windows-only BLAS than no BLAS at all. I just don't think we should be setting our sights so low at this point. The marginal cost of portability doesn't seem high. Besides, even Windows users will benefit more from having a standard cross-platform BLAS that everyone uses -- it would mean lots more people familiar with the library's quirks, better testing, etc. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Sturla Molden

11:07 p.m.

On 12/04/14 00:39, Nathaniel Smith wrote:

...

The spawn mode is fine and all, but (a) the presence of something in 3.4 helps only a minority of users, (b) "spawn" is not a full replacement for fork;

It basically does the same as on Windows. If you want portability to Windows, you must abide by these restrictions anyway.

...

with large read-mostly data sets it can be a *huge* win to load them into the parent process and then let them be COW-inherited by forked children.

The thing is that Python reference counts breaks COW fork. This has been discussed several times on the Python-dev list. What happens is that as soon as the child process updates a refcount, the OS copies the page. And because of how Python behaves, this copying of COW-marked pages quickly gets excessive. Effectively the performance of os.fork in Python will close to a non-COW fork. A suggested solution is to move the refcount out of the PyObject struct, and perhaps keep them in a dedicated heap. But doing so will be unfriendly to cache.

...

ATM the only other way to work with a data set that's larger than memory-divided-by-numcpus is to explicitly set up shared memory, and this is *really* hard for anything more complicated than a single flat array.

Not difficult. You just go to my GitHub site and grab the code ;) (I have some problems running it on my MBP though, not sure why, but it used to work on Linux and Windows, and possibly still does.) https://github.com/sturlamolden/sharedmem-numpy Sturla

Nathaniel Smith

11:16 p.m.

On Sat, Apr 12, 2014 at 12:07 AM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 12/04/14 00:39, Nathaniel Smith wrote:

...
The spawn mode is fine and all, but (a) the presence of something in 3.4 helps only a minority of users, (b) "spawn" is not a full replacement for fork;

It basically does the same as on Windows. If you want portability to Windows, you must abide by these restrictions anyway.

Yes, but "sorry Unix guys, we've decided to take away this nice feature from you because it doesn't work on Windows" is a really terrible argument. If it can't be made to work, then fine, but fork safety is just not *that* much to ask.

...

...
with large read-mostly data sets it can be a *huge* win to load them into the parent process and then let them be COW-inherited by forked children.

The thing is that Python reference counts breaks COW fork. This has been discussed several times on the Python-dev list. What happens is that as soon as the child process updates a refcount, the OS copies the page. And because of how Python behaves, this copying of COW-marked pages quickly gets excessive. Effectively the performance of os.fork in Python will close to a non-COW fork. A suggested solution is to move the refcount out of the PyObject struct, and perhaps keep them in a dedicated heap. But doing so will be unfriendly to cache.

Yes, it's limited, but again this is not a reason to break it in the cases where it *does* work. The case where I ran into this was loading a big language model using SRILM: http://www.speech.sri.com/projects/srilm/ https://github.com/njsmith/pysrilm This produces a single Python object that references an opaque, tens-of-gigabytes mess of C++ objects. For this case explicit shared mem is useless, but fork worked brilliantly. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Sturla Molden

11:32 p.m.

On 12/04/14 01:07, Sturla Molden wrote:

...

...
ATM the only other way to work with a data set that's larger than memory-divided-by-numcpus is to explicitly set up shared memory, and this is *really* hard for anything more complicated than a single flat array.

Not difficult. You just go to my GitHub site and grab the code ;)

(I have some problems running it on my MBP though, not sure why, but it used to work on Linux and Windows, and possibly still does.)

https://github.com/sturlamolden/sharedmem-numpy

Hmm, today it works fine on my MBP too... Good. :) import multiprocessing as mp import numpy as np import sharedmem as shm def proc(qin, qout): print("grabbing array from queue") a = qin.get() print(a) print("putting array in queue") b = shm.zeros(10) print(b) qout.put(b) print("waiting for array to be updated by another process") a = qin.get() print(b) if __name__ == "__main__": qin = mp.Queue() qout = mp.Queue() p = mp.Process(target=proc, args=(qin,qout)) p.start() a = shm.zeros(4) qin.put(a) b = qout.get() b[:] = range(10) qin.put(None) p.join() sturla$ python example.py grabbing array from queue [ 0. 0. 0. 0.] putting array in queue [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] waiting for array to be updated by another process [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] Sturla

Matthew Brett

9:11 p.m.

Hi, On Fri, Apr 11, 2014 at 10:05 AM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark magic, just plain C99 compiled with Intel icc, that would be sufficient for binary wheels on Windows I think.

Did you check out the Intel license though? http://software.intel.com/sites/default/files/managed/95/23/Intel_SW_Dev_Pro... D. DISTRIBUTION: Distribution of the Redistributables is also subject to the following limitations: You (i) shall be solely responsible to your customers for any update or support obligation or other liability which may arise from the distribution, (ii) shall not make any statement that your product is "certified", or that its performance is guaranteed, by Intel, (iii) shall not use Intel's name or trademarks to market your product without written permission, (iv) shall use a license agreement that prohibits disassembly and reverse engineering of the Redistributables, (v) shall indemnify, hold harmless, and defend Intel and its suppliers from and against any claims or lawsuits, including attorney's fees, that arise or result from your distribution of any product. Are you sure that you can redistribute object code statically linked against icc runtimes? Cheers, Matthew

Sturla Molden

9:58 p.m.

On 11/04/14 23:11, Matthew Brett wrote:

...

Are you sure that you can redistribute object code statically linked against icc runtimes?

I am not a lawyer...

Matthew Brett

10:01 p.m.

On Fri, Apr 11, 2014 at 2:58 PM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 11/04/14 23:11, Matthew Brett wrote:

...
Are you sure that you can redistribute object code statically linked against icc runtimes?

I am not a lawyer...

No - sure - but it would be frustrating if you found yourself optimizing with a compiler that is useless for subsequent open-source builds. Best, Matthew

Sturla Molden

10:10 p.m.

On 12/04/14 00:01, Matthew Brett wrote:

...

No - sure - but it would be frustrating if you found yourself optimizing with a compiler that is useless for subsequent open-source builds.

No, I think MSVC or gcc 4.8/4.9 will work too. It's just that I happen to have icc and clang on this computer :) Sturla

Michael Lehn

10:25 a.m.

New subject: The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...

Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far? Cheers, Michael

Nathaniel Smith

11:30 p.m.

On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...

Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Sturla Molden

11:52 p.m.

On 29/04/14 01:30, Nathaniel Smith wrote:

...

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point.

I think OpenBLAS in the long run is doomed as an OSS project. Having huge portions of the source in assembly is not sustainable in 2014. OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming abandonware. Sturla

Nathaniel Smith

12:01 a.m.

On Tue, Apr 29, 2014 at 12:52 AM, Sturla Molden <sturla.molden@gmail.com> wrote:

...

On 29/04/14 01:30, Nathaniel Smith wrote:

...
I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point.

I think OpenBLAS in the long run is doomed as an OSS project. Having huge portions of the source in assembly is not sustainable in 2014. OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming abandonware.

Have you read the paper I linked? I really recommend it. BLIS is apparently 95% straight-up-C, plus a slot where you stick in a tiny CPU-specific super-optimized kernel [1]. So this localizes the nasty stuff to one tiny function, plus most of the kernels that have been written so far do in fact use intrinsics [2]. [1] https://code.google.com/p/blis/wiki/KernelsHowTo [2] https://code.google.com/p/blis/wiki/HardwareSupport -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Dr. Michael Lehn

July 2014

6:21 a.m.

New subject: The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Am 29.04.2014 um 02:01 schrieb Nathaniel Smith <njs@pobox.com>:

...

On Tue, Apr 29, 2014 at 12:52 AM, Sturla Molden <sturla.molden@gmail.com> wrote:

...
On 29/04/14 01:30, Nathaniel Smith wrote:

...
I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point.

I think OpenBLAS in the long run is doomed as an OSS project. Having huge portions of the source in assembly is not sustainable in 2014. OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming abandonware.

Have you read the paper I linked? I really recommend it. BLIS is apparently 95% straight-up-C, plus a slot where you stick in a tiny CPU-specific super-optimized kernel [1]. So this localizes the nasty stuff to one tiny function, plus most of the kernels that have been written so far do in fact use intrinsics [2].

[1] https://code.google.com/p/blis/wiki/KernelsHowTo [2] https://code.google.com/p/blis/wiki/HardwareSupport

I was teaching this summer an undergraduate class „Software Basics on HPC“. Of course on topic was the efficient implementation of the matrix-matrix product GEMM. The BLIS paper [1] is a great source for that. In my opinion having your own hands-on experience is very important for actually understanding this concepts. That in particular means that we implemented our own matrix-matrix product. The pure C (ANSI C) implementation has less than 450 lines of code. The code consists of several function and students developed these functions one by one from one assignment to the other. You can see the result here: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page02/index.html#toc4 Other assignments where about improving the micro kernel with SSE instructions. You can travers through the pages to see how we where doing so step by step. Please understand that this course material is still work in progress and needs some polish here and there. Still it could be useful for others and even a starting point for a simple BLAS implementation. Cheers, Michael [1]: http://www.cs.utexas.edu/users/flame/pubs/BLISTOMSrev2.pdf ----------------------------------------------------------------------------------- Dr. Michael Lehn University of Ulm, Institute for Numerical Mathematics Helmholtzstr. 20 D-89069 Ulm, Germany Phone: (+49) 731 50-23534, Fax: (+49) 731 50-23548 -----------------------------------------------------------------------------------

Matthew Brett

April 2014

12:05 a.m.

Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? Cheers, Matthew

Julian Taylor

12:10 a.m.

On 29.04.2014 02:05, Matthew Brett wrote:

...

Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?

On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/ It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done.

Nathaniel Smith

12:23 a.m.

On Tue, Apr 29, 2014 at 1:10 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

On 29.04.2014 02:05, Matthew Brett wrote:

...
Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?

On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/

It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done.

BLIS itself ships with a BLAS-compatible interface, that you can use with reference LAPACK (just like OpenBLAS). I wouldn't be surprised if there are various annoying Fortran/C ABI hacks remaining to be worked out, but at least in principle BLIS is a BLAS. The problem is that this BLAS has no threading, runtime configuration (you have to edit a config file and recompile to change CPU support), or windows build goop. Basically the authors seem to still be thinking of a BLAS library's target audience as being supercomputer sysadmins, not naive end-users. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Olivier Grisel

May 2014

11:52 a.m.

BLIS looks interesting. Besides threading and runtime configuration, adding support for building it as a shared library would also be required to be usable by python packages that have several extension modules that link against a BLAS implementation. https://code.google.com/p/blis/wiki/FAQ#Can_I_build_BLIS_as_a_shared_library... """ Can I build BLIS as a shared library? The BLIS build system is not yet capable of outputting a shared library. Building and using shared libraries requires careful attention to various linkage and runtime details that, quite frankly, the BLIS developers would rather avoid if possible. If this feature is important to you, please speak up on the blis-devel mailing list. """ Also Windows support is still considered experimental according to the same FAQ. -- Olivier

Matthieu Brucher

12:23 p.m.

Yes, they seem to be focused on HPC clusters with sometimes old rules (as no shared library). Also, they don't use a potable Makefile generator, not even autoconf, this may also play a role in Windows support. 2014-05-12 12:52 GMT+01:00 Olivier Grisel <olivier.grisel@ensta.org>:

...

BLIS looks interesting. Besides threading and runtime configuration, adding support for building it as a shared library would also be required to be usable by python packages that have several extension modules that link against a BLAS implementation.

https://code.google.com/p/blis/wiki/FAQ#Can_I_build_BLIS_as_a_shared_library...

""" Can I build BLIS as a shared library?

The BLIS build system is not yet capable of outputting a shared library. Building and using shared libraries requires careful attention to various linkage and runtime details that, quite frankly, the BLIS developers would rather avoid if possible. If this feature is important to you, please speak up on the blis-devel mailing list. """

Also Windows support is still considered experimental according to the same FAQ.

-- Olivier _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/

Carl Kleffner

12:54 p.m.

Neither the numpy ATLAS build nor the MKL build on Windows makes use of shared libs. The latter due due licence restriction. Carl 2014-05-12 14:23 GMT+02:00 Matthieu Brucher <matthieu.brucher@gmail.com>:

...

Yes, they seem to be focused on HPC clusters with sometimes old rules (as no shared library). Also, they don't use a potable Makefile generator, not even autoconf, this may also play a role in Windows support.

2014-05-12 12:52 GMT+01:00 Olivier Grisel <olivier.grisel@ensta.org>:

...
BLIS looks interesting. Besides threading and runtime configuration, adding support for building it as a shared library would also be required to be usable by python packages that have several extension modules that link against a BLAS implementation.

https://code.google.com/p/blis/wiki/FAQ#Can_I_build_BLIS_as_a_shared_library ?

...
""" Can I build BLIS as a shared library?

The BLIS build system is not yet capable of outputting a shared library. Building and using shared libraries requires careful attention to various linkage and runtime details that, quite frankly, the BLIS developers would rather avoid if possible. If this feature is important to you, please speak up on the blis-devel mailing list. """

Also Windows support is still considered experimental according to the

same FAQ.

...
-- Olivier _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/ _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthieu Brucher

1:01 p.m.

There is the issue of installing the shared library at the proper location as well IIRC? 2014-05-12 13:54 GMT+01:00 Carl Kleffner <cmkleffner@gmail.com>:

...

Neither the numpy ATLAS build nor the MKL build on Windows makes use of shared libs. The latter due due licence restriction.

Carl

2014-05-12 14:23 GMT+02:00 Matthieu Brucher <matthieu.brucher@gmail.com>:

...
Yes, they seem to be focused on HPC clusters with sometimes old rules (as no shared library). Also, they don't use a potable Makefile generator, not even autoconf, this may also play a role in Windows support.

2014-05-12 12:52 GMT+01:00 Olivier Grisel <olivier.grisel@ensta.org>:

...
BLIS looks interesting. Besides threading and runtime configuration, adding support for building it as a shared library would also be required to be usable by python packages that have several extension modules that link against a BLAS implementation.

https://code.google.com/p/blis/wiki/FAQ#Can_I_build_BLIS_as_a_shared_library...

""" Can I build BLIS as a shared library?

The BLIS build system is not yet capable of outputting a shared library. Building and using shared libraries requires careful attention to various linkage and runtime details that, quite frankly, the BLIS developers would rather avoid if possible. If this feature is important to you, please speak up on the blis-devel mailing list. """

Also Windows support is still considered experimental according to the same FAQ.

-- Olivier _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/ _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/

Matthew Brett

5:25 p.m.

Hi, On Mon, May 12, 2014 at 6:01 AM, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:

...

There is the issue of installing the shared library at the proper location as well IIRC?

As Carl implies, the standard numpy installers do static linking to the BLAS lib, so we haven't (as far as I know) got a proper location for the shared library. Maybe it could be part of the API though, like "np.get_include()" but numpy "np.get_blas_lib()"? Where this can often be None. Cheers, Matthew

Carl Kleffner

7:54 p.m.

2014-05-12 19:25 GMT+02:00 Matthew Brett <matthew.brett@gmail.com>:

...

Hi,

On Mon, May 12, 2014 at 6:01 AM, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:

...
There is the issue of installing the shared library at the proper location as well IIRC?

As Carl implies, the standard numpy installers do static linking to the BLAS lib, so we haven't (as far as I know) got a proper location for the shared library.

Maybe it could be part of the API though, like "np.get_include()" but numpy "np.get_blas_lib()"? Where this can often be None.

The proper location would be in numpy/core/, since _dotblas.pyd is the first occurence of a blas dependant extension during numpy import. Otherwise some kind of preloading is necessary. Carl

...

Cheers,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Matthew Brett

April 2014

12:41 a.m.

Hi, On Mon, Apr 28, 2014 at 5:10 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

On 29.04.2014 02:05, Matthew Brett wrote:

...
Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?

On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/

It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done.

Has anyone tried building scipy with libflame yet? Cheers, Matthew

Nathaniel Smith

12:50 a.m.

On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...

Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?

Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Matthew Brett

4:09 a.m.

Hi, On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.lehn@uni-ulm.de> wrote:

...
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.molden@gmail.com>:

...
Sturla Molden <sturla.molden@gmail.com> wrote:

...
Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run.

To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is "best". Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it?

Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something.

The problem is the time and mental energy involved in the proof-of-concept may be enough to prevent it being done, and having some money to pay for time and to placate employers may be useful in overcoming that. To be clear - not me - I will certainly help if I can, but being paid isn't going to help me work on this. Cheers, Matthew

Julian Taylor

6:29 p.m.

On 11.04.2014 18:03, Nathaniel Smith wrote:

...

On Fri, Apr 11, 2014 at 12:21 PM, Carl Kleffner <cmkleffner@gmail.com> wrote:

...
a discussion about OpenBLAS on the octave maintainer list: http://article.gmane.org/gmane.comp.gnu.octave.maintainers/38746

I'm getting the impression that OpenBLAS is being both a tantalizing opportunity and a practical thorn-in-the-side for everyone -- Python, Octave, Julia, R.

How crazy would it be to get together an organized effort to fix this problem -- "OpenBLAS for everyone"? E.g., by collecting patches to fix the bits we don't like (like unhelpful build system defaults), applying more systematic QA, etc. Ideally this could be done upstream, but if upstream is MIA or disagrees about OpenBLAS's goals, then it could be maintained as a kind of "OpenBLAS++" that merges regularly from upstream (compare to [1][2][3] for successful projects handled in this way). If hardware for testing is a problem, then I suspect NumFOCUS would be overjoyed to throw a few kilodollars at buying one instance of each widely-distributed microarchitecture released in the last few years as a test farm...

x86 cpus are backward compatible with almost all instructions they ever introduced, so one machine with the latest instruction set supported is sufficient to test almost everything. For that the runtime kernel selection must be tuneable via the environment so you can use kernels intended for older cpus. The larger issue is finding a good and thorough testsuite that wasn't written 30 years ago and thus does covers problem sizes larger than a few megabytes. These are the problem sizes are that often crashed openblas in the past. Isn't there a kind of comprehensive BLAS verification testsuite which all BLAS implementations should test against and contribute to available somewhere? E.g. like the POSIX compliance testsuite.

Nathaniel Smith

11:19 p.m.

On Fri, Apr 11, 2014 at 7:29 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

x86 cpus are backward compatible with almost all instructions they ever introduced, so one machine with the latest instruction set supported is sufficient to test almost everything. For that the runtime kernel selection must be tuneable via the environment so you can use kernels intended for older cpus.

Overriding runtime kernel selection sounds like a good bite-sized feature that could be added to OpenBLAS...

...

The larger issue is finding a good and thorough testsuite that wasn't written 30 years ago and thus does covers problem sizes larger than a few megabytes. These are the problem sizes are that often crashed openblas in the past. Isn't there a kind of comprehensive BLAS verification testsuite which all BLAS implementations should test against and contribute to available somewhere? E.g. like the POSIX compliance testsuite.

I doubt it! Someone could make a good start on one in an afternoon though. (Only a start, but half a test suite is heck of a lot better than nothing.) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Nathaniel Smith

11:42 p.m.

Okay, I started taking notes here: https://github.com/numpy/numpy/wiki/BLAS-desiderata Please add as appropriate... -n On Sat, Apr 12, 2014 at 12:19 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Apr 11, 2014 at 7:29 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...
x86 cpus are backward compatible with almost all instructions they ever introduced, so one machine with the latest instruction set supported is sufficient to test almost everything. For that the runtime kernel selection must be tuneable via the environment so you can use kernels intended for older cpus.

Overriding runtime kernel selection sounds like a good bite-sized feature that could be added to OpenBLAS...

...
The larger issue is finding a good and thorough testsuite that wasn't written 30 years ago and thus does covers problem sizes larger than a few megabytes. These are the problem sizes are that often crashed openblas in the past. Isn't there a kind of comprehensive BLAS verification testsuite which all BLAS implementations should test against and contribute to available somewhere? E.g. like the POSIX compliance testsuite.

I doubt it! Someone could make a good start on one in an afternoon though. (Only a start, but half a test suite is heck of a lot better than nothing.)

-n

-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Dinesh Vadhia

4:15 p.m.

New subject: The BLAS problem

Agree that OpenBLAS is the most favorable route instead of starting from scratch. Btw, why is sparse BLAS not included as I've always been under the impression that scipy sparse supports BLAS - no?

Matthew Brett

6:32 p.m.

Hi, On Fri, Apr 11, 2014 at 9:03 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Apr 11, 2014 at 12:21 PM, Carl Kleffner <cmkleffner@gmail.com> wrote:

...
a discussion about OpenBLAS on the octave maintainer list: http://article.gmane.org/gmane.comp.gnu.octave.maintainers/38746

I'm getting the impression that OpenBLAS is being both a tantalizing opportunity and a practical thorn-in-the-side for everyone -- Python, Octave, Julia, R.

How crazy would it be to get together an organized effort to fix this problem -- "OpenBLAS for everyone"? E.g., by collecting patches to fix the bits we don't like (like unhelpful build system defaults), applying more systematic QA, etc. Ideally this could be done upstream, but if upstream is MIA or disagrees about OpenBLAS's goals, then it could be maintained as a kind of "OpenBLAS++" that merges regularly from upstream (compare to [1][2][3] for successful projects handled in this way). If hardware for testing is a problem, then I suspect NumFOCUS would be overjoyed to throw a few kilodollars at buying one instance of each widely-distributed microarchitecture released in the last few years as a test farm...

I think the goal is pretty clear: a modern optionally-multithreaded BLAS under a BSD-like license with a priority on correctness, out-of-the-box functionality (like runtime configurability and feature detection), speed, and portability, in that order.

It sounds like a joint conversation with R, Julia, Octave team at least would be useful here, Anyone volunteer for starting that conversation? Cheers, Matthew

Julian Taylor

6:53 p.m.

On 11.04.2014 18:03, Nathaniel Smith wrote:

...

On Fri, Apr 11, 2014 at 12:21 PM, Carl Kleffner <cmkleffner@gmail.com> wrote:

...
a discussion about OpenBLAS on the octave maintainer list: http://article.gmane.org/gmane.comp.gnu.octave.maintainers/38746

I'm getting the impression that OpenBLAS is being both a tantalizing opportunity and a practical thorn-in-the-side for everyone -- Python, Octave, Julia, R.

does anyone have experience with BLIS? https://code.google.com/p/blis/ https://github.com/flame/blis from the description it looks interesting and its BSD licensed. though windows support is experimental according to the FAQ.

Nathaniel Smith

7:34 p.m.

On Fri, Apr 11, 2014 at 7:53 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

...

On 11.04.2014 18:03, Nathaniel Smith wrote:

...
On Fri, Apr 11, 2014 at 12:21 PM, Carl Kleffner <cmkleffner@gmail.com> wrote:

...
a discussion about OpenBLAS on the octave maintainer list: http://article.gmane.org/gmane.comp.gnu.octave.maintainers/38746

I'm getting the impression that OpenBLAS is being both a tantalizing opportunity and a practical thorn-in-the-side for everyone -- Python, Octave, Julia, R.

does anyone have experience with BLIS? https://code.google.com/p/blis/ https://github.com/flame/blis

Also: Does BLIS automatically detect my hardware? Not yet. For now, BLIS requires the user/developer to manually specify an existing configuration that corresponds to the hardware for which to build a BLIS library. So for now, BLIS is mostly a developer's tool? Yes. In order to achieve high performance, BLIS requires that hand-coded kernels and micro-kernels be written and referenced in a valid BLIS configuration. These components are usually written by developers and then included within BLIS for use by others. If high performance is not important, then you can always build the reference implementation on any hardware platform. The reference implementation does not contain any machine-specific code and thus should be very portable. Does BLIS support multithreading? BLIS does not yet implement multithreaded versions of its operations. However, BLIS can very easily be made thread-safe so that you can call BLIS from threads[...] Can I build BLIS as a shared library? The BLIS build system is not yet capable of outputting a shared library. [...] https://code.google.com/p/blis/wiki/FAQ -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

3858

Age (days ago)

3949

Last active (days ago)

List overview

Download

37 comments

10 participants

participants (10)

Carl Kleffner
Dinesh Vadhia
Dr. Michael Lehn
Julian Taylor
Matthew Brett
Matthieu Brucher
Michael Lehn
Nathaniel Smith
Olivier Grisel
Sturla Molden

The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

Sturla Molden

Sturla Molden

Sturla Molden

Sturla Molden

Sturla Molden

Sturla Molden

Sturla Molden

Michael Lehn

Sturla Molden

Dr. Michael Lehn

Dinesh Vadhia

tags

participants (10)