I was wondering if anyone has thought about accelerating NumPy with a GPU. For example nVidia's CUDA SDK provides a feasible way to offload vector math onto the very fast SIMD processors available on the GPU. Currently GPUs primarily support single precision floats and are not IEEE compliant, but still could be useful for some applications. If there turns out to be a significant speedup over using the CPU, this could be a very accessible way to do scientific and numerical computation using GPUs, much easier than coding directly to the GPU APIs. Martin
On 5/31/07, Martin Ünsal <martinunsal@gmail.com> wrote:
I was wondering if anyone has thought about accelerating NumPy with a GPU. For example nVidia's CUDA SDK provides a feasible way to offload vector math onto the very fast SIMD processors available on the GPU. Currently GPUs primarily support single precision floats and are not IEEE compliant, but still could be useful for some applications.
I've thought about it, but I think it would be a heck of a lot of work. NumPy works with subarrays a lot and I suspect this would make it tricky to stream through a GPU. Making good use of the several pipelines would also require a certain degree of parallelism which is not there now. We would also need computation of sin, cos, and other functions for ufuncs, so that might not work well. For ordinary matrix/array arithmetic the shortest route might be a version of ATLAS/BLAS, some of LAPACK, and maybe an FFT library written to use a GPU. Chuck
Hi Martin,
I was wondering if anyone has thought about accelerating NumPy with a GPU. For example nVidia's CUDA SDK provides a feasible way to offload vector math onto the very fast SIMD processors available on the GPU. Currently GPUs primarily support single precision floats and are not IEEE compliant, but still could be useful for some applications.
I wasn't actually there, but I noticed that last year's SciPy conference page includes a talk entitled "GpuPy: Using GPUs to Accelerate NumPy", by Benjamin Eitzen (I think I also found his Web page via Google): http://www.scipy.org/SciPy2006/Schedule I also wondered whether Benjamin or anyone else who is interested had come across the Open Graphics Project (hadn't got around to asking)? http://wiki.duskglow.com/tikiindex.php?page=opengraphics This would be quite a specialized combination, but I'm sure it could be useful to some people with high performance requirements or maybe building some kind of special appliances. Cheers, James.
Martin Ünsal <martinunsal <at> gmail.com> writes:
I was wondering if anyone has thought about accelerating NumPy with a GPU. For example nVidia's CUDA SDK provides a feasible way to offload vector math onto the very fast SIMD processors available on the GPU. Currently GPUs primarily support single precision floats and are not IEEE compliant, but still could be useful for some applications.
If there turns out to be a significant speedup over using the CPU, this could be a very accessible way to do scientific and numerical computation using GPUs, much easier than coding directly to the GPU APIs.
Martin
I've thought about this too and think that it's a great idea. The existing library Brook, which has a similar programming model to NumPy, proves that it's feasible. And Brook was originally done with OpenGL & DirectX as backends to access the hardware. Needless to say, that's a lot harder than using CUDA. Since it hasn't already been pointed out, CUDA includes the cuBLAS and cuFFT libraries. I don't what the status of a LAPACK built on top of the cuBLAS is, but I'd be surprised if someone isn't already working on it. Also, NVIDIA has stated that doubleprecision hardware will be available later this year, in case that's an issue for anyone. I agree very much that it would make the GPUs more accessible, although CUDA has done an amazing job at that already. I think the most helpful thing about this would be if it allowed us to code using the existing array interface from NumPy in a way that the code automatically runs on the GPU in an optimized way  using shared memory + avoiding bank conflicts. I'd happily contribute to such a project if someone else got it started.
This is very much worth pursuing. I have been working on things related to this on and off at my day job. I can't say specifically what I have been doing, but I can make some general comments: * It is very easy to wrap the different parts of cude using ctypes and call it from/numpy. * Compared to a recent fast Intel CPU, the speedups we see are consistent with what the NVIDIA literature reports: 1030x is common and in some cases we have seen up to 170x. * Certain parts of numpy will be very easy to accelerate: things covered by blas, ffts, and ufuncs, random variates  but each of these will have very different speedups. * LAPACK will be tough, extremely tough in some cases. The main issue is that various algorithms in LAPACK rely on different levels of BLAS (1,2, or 3). The algorithms in LAPACK that primarily use level 1 BLAS functions (vector operations), like LUdecomp, are probably not worth porting to the GPU  at least not using the BLAS that NVIDIA provides. On the other hand, the algorithms that use more of the level 2 and 3 BLAS functions are probably worth looking at. * NVIDIA made a design decision in its implementation of cuBLAS and cuFFT that is somewhat detrimental for certain algorithms. In their implementation, the BLAS and FFT routines can _only_ be called from the CPU, not from code running on the GPU. Thus if you have an algorithm that makes many calls to cuBLAS/cuFFT, you pay a large overhead in having to keep the main flow of the algorithm on the CPU. It is not uncommon for this overhead to completely erode any speedup you may have gotten on the GPU. * For many BLAS calls, the cuBLAS won't be much faster than a good optimized BLAS from ATLAS or Goto. Brian On 5/31/07, Martin Ünsal <martinunsal@gmail.com> wrote:
I was wondering if anyone has thought about accelerating NumPy with a GPU. For example nVidia's CUDA SDK provides a feasible way to offload vector math onto the very fast SIMD processors available on the GPU. Currently GPUs primarily support single precision floats and are not IEEE compliant, but still could be useful for some applications.
If there turns out to be a significant speedup over using the CPU, this could be a very accessible way to do scientific and numerical computation using GPUs, much easier than coding directly to the GPU APIs.
Martin
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
participants (5)

Andrew Corrigan

Brian Granger

Charles R Harris

James Turner

Martin Ünsal