[Numpy-discussion] Fwd: GPU Numpy

Tue Sep 8 15:19:05 EDT 2009

Sturla Molden <sturla <at> molden.no> writes:

> 
> Erik Tollerud skrev:
> >> NumPy arrays on the GPU memory is an easy task. But then I would have to
> >> write the computation in OpenCL's dialect of C99? 
> > This is true to some extent, but also probably difficult to do given
> > the fact that paralellizable algorithms are generally more difficult
> > to formulate in striaghtforward ways. 
> Then you have misunderstood me completely. Creating an ndarray that has 
> a buffer in graphics memory is not too difficult, given that graphics 
> memory can be memory mapped. This has nothing to do with parallelizable 
> algorithms or not. It is just memory management. We could make an 
> ndarray subclass that quickly puts is content in a buffer accessible to 
> the GPU. That is not difficult. But then comes the question of what you 
> do with it.
> 
> I think many here misunderstands the issue here:
> 
> Teraflops peak performance of modern GPUs is impressive. But NumPy 
> cannot easily benefit from that. In fact, there is little or nothing to 
> gain from optimising in that end. In order for a GPU to help, 
> computation must be the time-limiting factor. It is not. There is not 
> more to say about using GPUs in NumPy right now.
> 
> Take a look at the timings here: http://www.scipy.org/PerformancePython 
> It shows that computing with NumPy is more than ten times slower than 
> using plain C. This is despite NumPy being written in C. The NumPy code 
> does not incur 10 times more floating point operations than the C code. 
> The floating point unit does not run in turtle mode when using NumPy. 
> NumPy's relative slowness compared to C has nothing to do with floating 
> point computation. It is due to inferior memory use (temporary buffers, 
> multiple buffer traversals) and memory access being slow. Moving 
> computation to the GPU can only make this worse.
> 
> Improved memory usage - e.g. through lazy evaluation and JIT compilaton 
> of expressions - can give up to a tenfold increase in performance. That 
> is where we must start optimising to get a faster NumPy. Incidentally, 
> this will  also make it easier to leverage on modern GPUs.
> 
> Sturla Molden
> 

I know that for my work, I can get around an order of a 50-fold speedup over
numpy using a python wrapper for a simple GPU matrix class.  So I might be
dealing with a lot of matrix products where I multiply a fixed 512 by 784 matrix
by a 784 by 256 matrix that changes between each matrix product, although to
really see the largest gains I use a 4096 by 2048 matrix times a bunch of 2048
by 256 matrices.  If all I was doing were those matrix products, it would be
even faster, but what I actually am doing is a matrix product, then adding a
column vector to the result, then applying an elementwise logistic sigmoid
function and potentially generating a matrix of pseudorandom numbers the same
shape as my result (although not always).  When I do these sorts of workloads,
my python numpy+GPU matrix class goes so much faster than anything that doesn't
use the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even bother
measuring the speedups precisely.  In some cases, my python code isn't making
too many temporaries since what it is doing is so simple, but in other cases
that is obviously slowing it down a bit.  I have relatively complicated jobs
that used to take weeks on the CPU can now take hours or days.

Obviously improved memory usage would be more helpful since not everyone has
access to the sorts of GPUs I use, but tenfold increases in performance seem
like chump change compared to what I see with the sorts of workloads I do.