[Numpy-discussion] Accelerating NumPy computations [Was: GPU Numpy]

Fri Aug 21 15:46:01 EDT 2009

On Fri, Aug 21, 2009 at 2:51 PM, Matthew Brett<matthew.brett at gmail.com> wrote:
> I can imagine Numpy being useful for scripting in this
> C-and-assembler-centric world, making it easier to write automated
> testers, or even generate C code.
>
> Is anyone out there working on this kind of stuff?  I ask only because
> there seems to be considerable interest here on the Berkeley campus.
>
> Best,
>
> Matthew

Frederic Bastien and I are working on this sort of thing.  We use a
project called theano to build symbolic expression graphs.  Theano
optimizes those graphs like an optimizing compiler, and then it
generates C code for those graphs.  We haven't put a lot of effort
into optimizing the C implementations of most expressions (except for
non-separable convolution), but we call fast blas and fftw functions,
and our naive implementations are typically faster than equivalent
numpy expressions just because they are in C.  (Although congrats to
those working at optimizing numpy... it has gotten a lot faster over
the last few years!)

We are now writing another backend that generates cuda runtime C++.
It is just like you say: even for simple tasks like adding two vectors
together or summing the elements of a matrix, there are several
possible kernels that can be optimal in different circumstances.  The
penalty of choosing a sub-optimal kernel can be pretty high.  So what
ends up happening is that even for simple ufunc-type expressions, we
have
- a version for when the arguments are small and everything is c-contiguous
- a general version that is typically orders of magnitude slower than
the optimal choice
- versions for when arguments are small and 1D, 2D, 3D, 4D, 5D
- versions for when various of the arguments are broadcasted in different ways
- versions for when there is at least one large contiguous dimension

And the list goes on.  We are still in the process of understanding
the architecture and the most effective strategies for optimization.
I think our design is a good one though from the users' perspective
because it supports a completely opaque front-end.. you just program
the symbolic graph in python using normal expressions, compile it as a
function, and call it.   The detail of whether it is evaluated on the
CPU or the GPU (or both) is hidden.

If anyone is interested in what we're doing please feel free to send
me an email.  Links to these projects are

http://www.pylearn.org/theano
http://code.google.com/p/theano-cuda-ndarray/
http://code.google.com/p/cuda-ndarray/

James
-- 
http://www-etud.iro.umontreal.ca/~bergstrj