[Numpy-discussion] Numpy and MKL, update

Thu Nov 13 23:37:05 EST 2008

David Cournapeau wrote:
> On Fri, Nov 14, 2008 at 11:07 AM, Michael Abshoff
> <michael.abshoff at googlemail.com> wrote:
>> David Cournapeau wrote:
>>> On Fri, Nov 14, 2008 at 5:23 AM, frank wang <f.yw at hotmail.com> wrote:
>>>> Hi,
>> Hi,
>>
>>>> Can you provide a working example to build Numpy with MKL in window and
>>>> linux?
>>>> The reason I am thinking to build the system is that I need to make the
>>>> speed match with matlab.
>>> The MKL will only help you for linear algebra, and more specifically
>>> for big matrices. If you build your own atlas, you can easily match
>>> matlab speed in that area, I think.
>> That is pretty much true in my experience for anything but Core2 Intel
>> CPUs where GotoBLAS and the latest MKL have about a 25% advantage for
>> large problems.
> 
> Note that I never said that ATLAS was faster than MKL/GotoBLAS :) 

:)

> I said you could match matlab performances (which itself, up to 6.* at
> least, used ATLAS; you could increase matlab performances by using
> your own ATLAS BTW).

Yes, back in the day I got a three fold speedup for a certain workload 
in Matlab by replacing BLAS and UMFPACK libraries.

> I don't think 25 % matter that much, because if
> it does, then you should not use python anyway in many cases (depends
> on the kind of problems of course, but I don't think most scientific
> problems reduce to just matrix product/inversion).

Sure, I agree here. But 25% performance for dgemm is significant for 
some workloads, but if you spend the vast majority of time in Python 
code it won't matter. And some times it is way more than that - see my 
remarks below.

>> The advantage of the MKL is that one library works more or less optimal
>> on all platforms, i.e. with and without SSE2 for example since the
>> "right" routines are selected at run time.
> 
> Agreed. As a numpy/scipy developer, I would actually be much more
> interested in work into that direction for ATLAS than trying to get a
> few % of peak speed.

Note that selecting non SSE2 versions of ATLAS can cause a significant 
slowdown, i.e. one day not too long ago Ondrej Certik and I were sitting 
in IRC in #sage-devel benchmarking some things. His Debian install was a 
factor of 12 slower than the same software that he had build with Sage 
and in the end it boiled down to non-SSE2 ATLAS vs. SSE2 ATLAS. That is 
a freak case, but I am sure more than enough people will get bitten by 
that issue since they installed "ATLAS" in Debian, but did not know 
about SSE2 ATLAS.

And a while back someone compared various numerical closed and open 
source projects in an article for some reknown Linux magazine, among 
them Sage. So they run a bunch of numerical benchmarks, namely FFT and 
SVD and Sage via numpy blew Matlab away by a factor of three for the SVD 
(The FFT looked not so good because Sage is still using GSL for FFT, but 
we will change that). Obviously that was not because numpy was clever 
about the SVD used (I know there are several version in Lapack, but the 
performance difference is usually small), but because Matlab used some 
generic version of BLAS (it was unclear form the article if it was MKL 
or ATLAS) and Sage used a custom build SSE2 version. The reviewer 
expressed admiration for numpy and its clever SVD implementation - Sigh.

> Deployment of ATLAS is really difficult ATM, and
> it means that practically, we lose a lot of performances because for
> distribution, you can't tune for every CPU out there, so we just use
> safe defaults. Same for linux distributions. It is a shame that Apple
> did not open source their Accelerate framework (based on ATLAS, at
> least for the BLAS/LAPACK part), because that's exactly what they did.

Yes, Clint has been in contact with Apple, but never got anything out of 
them. Too bad. The new ATLAS release should fix some build issues 
regarding the dreaded timing tolerance issue and also will work much 
better with threads since Clint rewrote the threading module so that the 
memory allocation is no longer the bottle neck. He also added native 
threading support for Windows, but that is not being tested yet, so 
hopefully it will work in a future version. The main issue here is that 
for assembly support Clint relies on gcc which is hardcoded into the 
Makefiles, so we discussed various options how that can be avoided, but 
so far nor progress can be reported.

> David

Cheers,

Michael

> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>