[Numpy-discussion] parallel numpy (by Brian Granger) - any info?

Tue Jan 8 13:18:20 EST 2008

On Jan 8, 2008 3:33 AM, Matthieu Brucher <matthieu.brucher at gmail.com> wrote:
>
> > I have AMD processor so I guess I should use ACML somehow instead.
> > However, at 1st I would prefer my code to be platform-independent, and
> > at 2nd unfortunately I haven't encountered in numpy documentation (in
> > website scipy.org and numpy.scipy.org) any mention about how to use
> > numpy multithreading at all (neither MKL nor ACML).
>
>
> MKL does the multithreading on its own for level 3 BLAS instructions
> (OpenMP). For ACML, the problem is that AMD does not provide a CBLAS
> interface and is not interested in doing so. With ACML, the compilation
> fails with the current Numpy, but hopefully with Scons it will work, at
> least for the LAPACK part. But I don't think that ACML is parallel.
> I think that using multithreaded libraries is far more interesting and easy
> to do than using distributed memory systems. This is due to the fact that
> Python can use some help to enable multi-processing (not GIL), for instance
> like Java and Jackal. After some readings, I think this means that the core
> Python should be updated.

Definitely, the easiest route on a multicore machine is to find a
library that is already multithreaded.  As long as that library is
"below" the GIL you will see a good speedup.

But, because of the GIL, distributed memory/message passing
applications in Python are much easier to write than shared
mem/threaded.  Especially with packages like mpi4py, message passing
in python is extremely easy and even pretty fast.  It is also very
simple to call mpi directly from within pyrex.

>
> > Also, I intended to try using numpy multithreading on our icyb cluster
> > (IIRC made of intel processors) from world top 500 (however, currently
> > connect to other subsets of processors from other cities have been
> > organized, some of them are AMD). Would 100-200 processors (I don't
> > remember how many have the one) yield at least 2x...3x speedup on some
> > of my test cases, it would be a good deal and something to report in my
> > graduation work.

If you have a multiprocessor cluster you really should look at using
mpi4py.  It handles numpy arrays very efficiently and performs
extremely well.  Threading won't help you a bit in this case.

>
> If you have access to Intel Quad-Core processors with the latest MKL and if
> you intensively use matrix multiplications, you will have those results. But
> if you speak at your graduation that using 100 or 200 processors and say
> that it only yields a 2 or 3 time speedup factor, I think the jury will not
> appreciate.
>
>
> > As my chief informed me, people here are fond of the cluster, mentioning
> > the magical word (in my work) would fill them with respect :)
>
> Then you should first start by looking how to make your algorithms parallel.
> Just throwing a number of processors will not yield a good speedup per
> processor, and this is what people are looking for : good scalability. Then
> you must use tools like the processing module, MPI, ...

This is always true.  It is shocking how easy it is to write parallel
code that is slow as hell.  It is very difficult to write parallel
code that is fast and scales well.

Brian

>
>
> Matthieu
> --
> French PhD student
> Website : http://matthieu-brucher.developpez.com/
> Blogs : http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> LinkedIn : http://www.linkedin.com/in/matthieubrucher
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>